SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
What’s new and awesome
       in pandas
pandas?
In [13]: foo
Out[13]:
    methyl1    age      edu         something   indic
0   38.36      30to39   geCollege   1           False
1   37.85      lt30     geCollege   1           False
2   38.57      30to39   geCollege   1           False
3   39.75      30to39   geCollege   1           True
4   43.83      30to39   geCollege   1           True
5   39.08      30to39   ltHS        1           True


  Size-mutable “labeled arrays” that
     can handle heterogeneous data
Kinda like a structured array??

•  Automatic data alignment with lots of
   reshaping and indexing methods

•  Implicit and explicit handling of missing
   data

•  Easy time series functionality
    –  Far less fuss than scikits.timeseries

•  Lots of in-memory SQL-like operations
   (group by, join, etc.)
pandas?
•  Extremely good for financial data
  –  StackOverflow: “this is a beast of a financial
     analysis tool”



•  One of the better relational data
   munging tools in any language?

•  But also has maybe 60+% of what R
   users expect when they come to
   Python
1. Heavily redesigned
         internals
•  Merged old DataFrame and DataMatrix
   into a single DataFrame: retain
   optimal performance where possible

•  Internal BlockManager class manages
   homogeneous ndarrays for optimal
   performance and reshaping
1. Heavily redesigned
         internals
•  Better handling of missing data for
   non-floating point dtypes

•  Soon: DataFrame variant with N-dim
   “hyperslabs”
2. Fancier indexing
Mix boolean / integer / label /
slice-based indexing

df.ix[0]
df.ix[date1:date2]
df.ix[:5, ‘A’:’F’]


Setting works too

df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
3. More robust IO
data_frame = read_csv(‘mydata.csv’)

data_frame2 = read_table(‘mydata.txt’, sep=‘t’,
                         skiprows=[1,2],
                         na_values=[‘#N/A NA’])



store = HDFStore(‘pytables.h5’)
store[‘a’] = data_frame
store[‘b’] = data_frame2
4. Better pivoting / reshaping

    foo   bar    A         B         C
0   one   a     -0.0524    1.664     1.171
1   one   a      0.2514    0.8306   -1.396
2   one   b      0.1256    0.3897    0.5227
3   one   b     -0.9301    0.6513   -0.2313
4   one   c      2.037     1.938    -0.3454
5   two   a      0.2073    0.7857    0.9051
6   two   a     -1.032    -0.8615    1.028
7   two   b     -0.7319   -1.846     0.9294
8   two   b      0.1004   -1.19      0.6043
9   two   c     -1.008    -0.3339    0.09522
4. Better pivoting / reshaping

In [29]: pivoted = df.pivot('bar', 'foo')

In [30]: pivoted['B']
Out[30]:
    one      two
a   1.664    0.7857
b   0.8306 -0.8615
c   0.3897 -1.846
d   0.6513 -1.19
e   1.938   -0.3339
4. Better pivoting / reshaping

In [31]: pivoted.major_xs('a')
Out[31]:
      A        B        C
one -0.0524    1.664    1.171
two   0.2073   0.7857   0.9051


In [32]: pivoted.minor_xs('one')
Out[32]:
    A        B        C
a -0.0524    1.664    1.171
b   0.2514   0.8306 -1.396
c   0.1256   0.3897   0.5227
d -0.9301    0.6513 -0.2313
e   2.037    1.938   -0.3454
4. Better pivoting / reshaping

In [30]: pivoted['B']
Out[30]:
    one      two
a   1.664    0.7857
b   0.8306 -0.8615
c   0.3897 -1.846
d   0.6513 -1.19
e   1.938   -0.3339
4. Some other things
•  “Sparse” (mostly NA) versions of
   data structures
•  Time zone support in DateRange
•  Generic moving window function
   rolling_apply
Near future
•  More powerful Group By

•  Flexible, fast frequency (time series) conversions

•  More integration with statsmodels
Thanks!
•  Hack: github.com/wesm/pandas

•  Twitter: @wesmckinn

•  Blog: blog.wesmckinney.com

Weitere ähnliche Inhalte

Was ist angesagt?

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkDatabricks
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Toolsboorad
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastDatabricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache SparkDatabricks
 

Was ist angesagt? (20)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
 

Ähnlich wie SciPy 2011 pandas lightning talk

Using the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesUsing the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesTiffany Timbers
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryDaniel Cuneo
 
Graphs for AI & ML, Jim Webber, Neo4j
Graphs for AI & ML, Jim Webber, Neo4jGraphs for AI & ML, Jim Webber, Neo4j
Graphs for AI & ML, Jim Webber, Neo4jNeo4j
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
In Search of Plan Stability - Part 1
In Search of Plan Stability - Part 1In Search of Plan Stability - Part 1
In Search of Plan Stability - Part 1Enkitec
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Databricks
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsBarbara Fusinska
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)PingCAP
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
python高级内存管理
python高级内存管理python高级内存管理
python高级内存管理rfyiamcool
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkDatabricks
 
[Www.pkbulk.blogspot.com]file and indexing
[Www.pkbulk.blogspot.com]file and indexing[Www.pkbulk.blogspot.com]file and indexing
[Www.pkbulk.blogspot.com]file and indexingAnusAhmad
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 

Ähnlich wie SciPy 2011 pandas lightning talk (20)

Using the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesUsing the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slides
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
 
Graphs for AI & ML, Jim Webber, Neo4j
Graphs for AI & ML, Jim Webber, Neo4jGraphs for AI & ML, Jim Webber, Neo4j
Graphs for AI & ML, Jim Webber, Neo4j
 
R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
In Search of Plan Stability - Part 1
In Search of Plan Stability - Part 1In Search of Plan Stability - Part 1
In Search of Plan Stability - Part 1
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
 
Pandas
PandasPandas
Pandas
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commits
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
python高级内存管理
python高级内存管理python高级内存管理
python高级内存管理
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
[Www.pkbulk.blogspot.com]file and indexing
[Www.pkbulk.blogspot.com]file and indexing[Www.pkbulk.blogspot.com]file and indexing
[Www.pkbulk.blogspot.com]file and indexing
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 

Mehr von Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 

Mehr von Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 

Kürzlich hochgeladen

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Kürzlich hochgeladen (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

SciPy 2011 pandas lightning talk

  • 1. What’s new and awesome in pandas
  • 2. pandas? In [13]: foo Out[13]: methyl1 age edu something indic 0 38.36 30to39 geCollege 1 False 1 37.85 lt30 geCollege 1 False 2 38.57 30to39 geCollege 1 False 3 39.75 30to39 geCollege 1 True 4 43.83 30to39 geCollege 1 True 5 39.08 30to39 ltHS 1 True Size-mutable “labeled arrays” that can handle heterogeneous data
  • 3. Kinda like a structured array?? •  Automatic data alignment with lots of reshaping and indexing methods •  Implicit and explicit handling of missing data •  Easy time series functionality –  Far less fuss than scikits.timeseries •  Lots of in-memory SQL-like operations (group by, join, etc.)
  • 4. pandas? •  Extremely good for financial data –  StackOverflow: “this is a beast of a financial analysis tool” •  One of the better relational data munging tools in any language? •  But also has maybe 60+% of what R users expect when they come to Python
  • 5. 1. Heavily redesigned internals •  Merged old DataFrame and DataMatrix into a single DataFrame: retain optimal performance where possible •  Internal BlockManager class manages homogeneous ndarrays for optimal performance and reshaping
  • 6. 1. Heavily redesigned internals •  Better handling of missing data for non-floating point dtypes •  Soon: DataFrame variant with N-dim “hyperslabs”
  • 7. 2. Fancier indexing Mix boolean / integer / label / slice-based indexing df.ix[0] df.ix[date1:date2] df.ix[:5, ‘A’:’F’] Setting works too df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
  • 8. 3. More robust IO data_frame = read_csv(‘mydata.csv’) data_frame2 = read_table(‘mydata.txt’, sep=‘t’, skiprows=[1,2], na_values=[‘#N/A NA’]) store = HDFStore(‘pytables.h5’) store[‘a’] = data_frame store[‘b’] = data_frame2
  • 9. 4. Better pivoting / reshaping foo bar A B C 0 one a -0.0524 1.664 1.171 1 one a 0.2514 0.8306 -1.396 2 one b 0.1256 0.3897 0.5227 3 one b -0.9301 0.6513 -0.2313 4 one c 2.037 1.938 -0.3454 5 two a 0.2073 0.7857 0.9051 6 two a -1.032 -0.8615 1.028 7 two b -0.7319 -1.846 0.9294 8 two b 0.1004 -1.19 0.6043 9 two c -1.008 -0.3339 0.09522
  • 10. 4. Better pivoting / reshaping In [29]: pivoted = df.pivot('bar', 'foo') In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339
  • 11. 4. Better pivoting / reshaping In [31]: pivoted.major_xs('a') Out[31]: A B C one -0.0524 1.664 1.171 two 0.2073 0.7857 0.9051 In [32]: pivoted.minor_xs('one') Out[32]: A B C a -0.0524 1.664 1.171 b 0.2514 0.8306 -1.396 c 0.1256 0.3897 0.5227 d -0.9301 0.6513 -0.2313 e 2.037 1.938 -0.3454
  • 12. 4. Better pivoting / reshaping In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339
  • 13. 4. Some other things •  “Sparse” (mostly NA) versions of data structures •  Time zone support in DateRange •  Generic moving window function rolling_apply
  • 14. Near future •  More powerful Group By •  Flexible, fast frequency (time series) conversions •  More integration with statsmodels
  • 15. Thanks! •  Hack: github.com/wesm/pandas •  Twitter: @wesmckinn •  Blog: blog.wesmckinney.com