A Journey Into the Emotions of Software Developers
What's new in pandas and the SciPy stack for financial users
1. What’s new in pandas and
the SciPy stack for financial
users
Wes McKinney
2. Me
• AQR: August 2007 - July 2010
• Duke Statistics: 2010 - present (now on leave)
• My plans
• Improving Python libs for statistics and finance
• Building a financial software + consulting business
based on said tools
4. General sentiments
• Scientific Python growing solidly in finance and
in many other fields
• Though good sci-pythonistas are still scarce
• Important work happening in many of the core
projects
• Growing consensus: a new computational
model is needed to better cope with “big data”
5. NumPy
• Significantly refactored C internals
• Great progress on native datetime64 type
• Will significantly improve date-handling
performance and usability
• Extensible business day / holiday logic
planned / in progress
• Addition of low-level missing data (NA)
support in the works
6. IPython
• One of Python’s killer apps gets even better
• Rich Qt GUI console with inline plotting
• New and improved architecture for high perf
parallel / distributed computing
• See Fernando Pérez’s SciPy 2011 talk / video
7. Cython
• Still the first tool you should reach for to get
better performance
• New: OpenMP integration (for multi-core)
with nogil:
for i in prange(n):
# do something in parallel
• Supports (almost) all of standard Python now
(some things, like closures, used to not work)
8. statsmodels
• Statistics and econometrics in Python
• Major work in time series models over last year+
• VAR, SVAR models, eventually (V)ECM models
for cointegrated time series
• AR/ARMA, Kalman Filter, various macro filters
(e.g. Hodrick-Prescott) implemented
• Soon: Bayesian state space models (DLMs),
ARCH/GARCH models, etc.
9. statsmodels
• Major criticism: weak user interface
• No R-style formula framework
• pandas not integrated (need to pass raw
NumPy arrays)
• I have begun work on pandas integration,
formulas have been implemented and will
hopefully arrive within the next few months
10. pandas
• Still the Python data hacker’s best friend?
• Most recent release: 0.3.0 on 2/20/2011
• However, last 4 months have been the most
active development period in the library’s
history
• ~375 commits since 0.3.0 release (more than
the entire prior open source history)
12. Ambitious big picture
• I want to make pandas the cornerstone of the
“next generation” statistical computing
environment
• Ease-of-use, performance, flexibility all equally
important
13. Ambitious big picture
• Taking the best features of other languages (R
and friends) and making them better and
easier to use
• See my recent blog article “A Roadmap for
Rich Scientific Data Structures in Python”
14. pandas: under the hood
• Complete redesign of DataFrame internals
• Now a single class for 2D data retaining
optimal performance of old DataFrame and
DataMatrix classes
• Significantly improved mixed-type and missing
data handling
• Plan to use internal data structure to
implement “NDFrame” for n-dimensional data
15. Fancy indexing
• Index a Series / DataFrame in a matrix-like
way via special .ix attribute, use:
• Slices with integers or labels
• Lists of integers, labels, or boolean vecs
• Integer or label locations
df.ix[0]
df.ix[date1:date2]
df.ix[:5, ‘A’:’F’]
df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
16. Misc new features
• “Sparse” (mostly NA) versions of Series,
DataFrame, WidePanel
• Many new functions on Series/DataFrame
• describe, quantile, select, drop, dropna,
corrwith, ...
• New moving window methods: rolling_quantile
and rolling_apply
17. Improved IO
• read_csv, read_table functions more
flexible and robust, better type inferencing
df = read_table(‘foo.txt’, skiprows=[0,1],
na_values=[‘#N/A’])
• ExcelFile class for reading multiple sheets
out of .xls files
18. Improved IO
• HDFStore class provides a complete, tested
dict-like PyTables storage container
store = HDFStore(‘mydata.h5’)
store[‘x’] = x
store[‘y’] = y
y = store[‘y’]
• Experimental: store as Table and query
store.put('df', df, table=True)
piece = store.select(‘df’,
[{‘field’ : ‘index’, ‘op’ : ‘>=’,
‘value’ : date}])
19. Group by enhancements
• Can group by multiple columns or key
functions, SQL-like but more general
• Syntactic sugar to invoke aggregation
functions on groups
• Automatic exclusion of “nuisance”
columns of DataFrames
• Various other usability enhancements
20. Very soon: hierarchical indexing
• Enable axis ticks to be identified by multiple
labels instead of a single label
• Easily select subsets of data by “level”
• Create Excel-style pivot tables / cross-
tabulations in a sensible way
• Will integrate naturally with groupby
21. Other misc things
• Flexible binary operators
• a.add(b, fill_value=0.)
• Some timezone support in DateRange
• Numerous performance optimizations
• See the (long) release notes =)
22. Planned work
• Fast time series up/downsampling
• Improved support and perf for HF/tick data
• Even more sophisticated group by tools
• Better documentation, online screencast
tutorials / examples