1. Financial data analysis in Python with pandas
Wes McKinney
@wesmckinn
10/17/2011
@wesmckinn () Data analysis with pandas 10/17/2011 1 / 22
2. My background
3 years as a quant hacker at AQR, now consultant / entrepreneur
Math and statistics background with the zest of computer science
Active in scientific Python community
My blog: http://blog.wesmckinney.com
Twitter: @wesmckinn
@wesmckinn () Data analysis with pandas 10/17/2011 2 / 22
3. Bare essentials for financial research
Fast time series functionality
Easy data alignment
Date/time handling
Moving window statistics
Resamping / frequency conversion
Fast data access (SQL databases, flat files, etc.)
Data visualization (plotting)
Statistical models
Linear regression
Time series models: ARMA, VAR, ...
@wesmckinn () Data analysis with pandas 10/17/2011 3 / 22
4. Would be nice to have
Portfolio and risk analytics, backtesting
Easy enough to write yourself, though most people do a bad job of it
Portfolio optimization
Most financial firms use a 3rd party library anyway
Derivative pricing
Can use QuantLib in most languages
@wesmckinn () Data analysis with pandas 10/17/2011 4 / 22
5. What are financial firms using?
HFT: a C++ and hardware arms race, a different topic
Research
Mainstream: R, MATLAB, Python, ...
Econometrics: Stata, eViews, RATS, etc.
Non-programmatic environments: ClariFI, Palantir, ...
Production
Popular: Java, C#, C++
Less popular, but growing: Python
Fringe: Functional languages (Ocaml, Haskell, F#)
@wesmckinn () Data analysis with pandas 10/17/2011 5 / 22
6. What are financial firms using?
Many hybrid languages environments (e.g. Java/R, C++/R,
C++/MATLAB, Python/C++)
Which is the main implementation language?
If main language is Java/C++, result is lower productivity and higher
cost to prototyping new functionality
Trends
Banks and hedge funds are realizing that Java-based production
systems can be replaced with 20% as much Python code (or less)
MATLAB is being increasingly ditched in favor of Python. R and
Python use for research generally growing
@wesmckinn () Data analysis with pandas 10/17/2011 6 / 22
7. Python language
Simple, expressive syntax
Designed for readability, like “runnable pseudocode”
Easy-to-use, powerful built-in types and data structures:
Lists and tuples (fixed-size, immutable lists)
Dicts (hash maps / associative arrays) and sets
Everything’s an object, including functions
“There should be one, and preferably only one way to do it”
“Batteries included”: great general purpose standard library
@wesmckinn () Data analysis with pandas 10/17/2011 7 / 22
8. A simple example: quicksort
Pseudocode from Wikipedia:
function qsort(array)
if length(array) < 2
return array
var list less, greater
select and remove a pivot value pivot from array
for each x in array
if x < pivot then append x to less
else append x to greater
return concat(qsort(less), pivot, qsort(greater))
@wesmckinn () Data analysis with pandas 10/17/2011 8 / 22
9. A simple example: quicksort
First try Python implementation:
def qsort ( array ):
if len ( array ) < 2:
return array
less , greater = [] , []
pivot , rest = array [0] , array [1:]
for x in rest :
if x < pivot :
less . append ( x )
else :
greater . append ( x )
return qsort ( less ) + [ pivot ] + qsort ( greater )
@wesmckinn () Data analysis with pandas 10/17/2011 9 / 22
10. A simple example: quicksort
Use list comprehensions:
def qsort ( array ):
if len ( array ) < 2:
return array
pivot , rest = array [0] , array [1:]
less = [ x for x in rest if x < pivot ]
greater = [ x for x in rest if x >= pivot ]
return qsort ( less ) + [ pivot ] + qsort ( greater )
@wesmckinn () Data analysis with pandas 10/17/2011 10 / 22
11. A simple example: quicksort
Heck, fit it onto one line!
qs = lambda r : ( r if len ( r ) < 2
else ( qs ([ x for x in r [1:] if x < r [0]])
+ [ r [0]]
+ qs ([ x for x in r [1:] if x >= r [0]])))
Though that’s starting to look like Lisp code...
@wesmckinn () Data analysis with pandas 10/17/2011 11 / 22
12. A simple example: quicksort
A quicksort using NumPy arrays
def qsort ( array ):
if len ( array ) < 2:
return array
pivot , rest = array [0] , array [1:]
less = rest [ rest < pivot ]
greater = rest [ rest >= pivot ]
return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )]
Of course no need for this when you can just do:
sorted_array = np.sort(array)
@wesmckinn () Data analysis with pandas 10/17/2011 12 / 22
13. Python: drunk with power
This comic has way too much airtime but:
@wesmckinn () Data analysis with pandas 10/17/2011 13 / 22
14. Staples of Python for science: MINS
(M) matplotlib: plotting and data visualization
(I) IPython: rich interactive computing and development environment
(N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random
number generation, etc.
(S) SciPy: optimization, probability distributions, signal processing,
ODEs, sparse matrices, ...
@wesmckinn () Data analysis with pandas 10/17/2011 14 / 22
15. Why did Python become popular in science?
NumPy traces its roots to 1995
Extremely easy to integrate C/C++/Fortran code
Access fast low level algorithms in a high level, interpreted language
The language itself
“It fits in your head”
“It [Python] doesn’t get in my way” - Robert Kern
Python is good at all the things other scientific programming
languages are not good at (e.g. networking, string processing, OOP)
Liberal BSD license: can use Python for commercial applications
@wesmckinn () Data analysis with pandas 10/17/2011 15 / 22
16. Some exciting stuff in the last few years
Cython
“Augmented” Python language with type declarations, for generating
compiled extensions
C-like speedups with Python-like development time
IPython: enhanced interactive Python interpreter
The best research and software development env for Python
An integrated parallel / distributed computing backend
GUI console with inline plotting and a rich HTML notebook (more on
this later)
PyCUDA / PyOpenCL: GPU computing in Python
Transformed Python overnight into one of the best languages for doing
GPU computing
@wesmckinn () Data analysis with pandas 10/17/2011 16 / 22
17. Where has Python historically been weak?
Rich data structures for data analysis and statistics
NumPy arrays, while powerful, feel distinctly “lower level” if you’re
used to R’s data.frame
pandas has filled this gap over the last 2 years
Statistics libraries
Nowhere near the depth of R’s CRAN repository
statsmodels provides tested implementations a lot of standard
regression and time series models
Turns out that most financial data analysis requires only fairly
elementary statistical models
@wesmckinn () Data analysis with pandas 10/17/2011 17 / 22
18. pandas library
Began building at AQR in 2008, open-sourced late 2009
Why
R / MATLAB, while good for research / data analysis, are not suitable
implementation languages for large-scale production systems
(I personally don’t care for them for data analysis)
Existing data structures for time series in R / MATLAB were too
limited / not flexible enough my needs
Core idea: indexed data structures capable of storing heterogeneous
data
Etymology: panel data structures
@wesmckinn () Data analysis with pandas 10/17/2011 18 / 22
19. pandas in a nutshell
A clean axis indexing design to support fast data alignment, lookups,
hierarchical indexing, and more
High-performance data structures
Series/TimeSeries: 1D labeled vector
DataFrame: 2D spreadsheet-like structure
Panel: 3D labeled array, collection of DataFrames
SQL-like functionality: GroupBy, joining/merging, etc.
Missing data handling
Time series functionality
@wesmckinn () Data analysis with pandas 10/17/2011 19 / 22
20. pandas design philosophy
“Think outside the matrix”: stop thinking about shape and start
thinking about indexes
Indexing and data alignment are essential
Fault-tolerance: save you from common blunders caused by coding
errors (specifically misaligned data)
Lift the best features of other data analysis environments (R,
MATLAB, Stata, etc.) and make them better, faster
Performance and usability equally important
@wesmckinn () Data analysis with pandas 10/17/2011 20 / 22
21. The pandas killer feature: indexing
Each axis has an index
Automatic alignment between differently-indexed objects: makes it
nearly impossible to accidentally combine misaligned data
Hierarchical indexing provides an intuitive way of structuring and
working with higher-dimensional data
Natural way of expressing “group by” and join-type operations
Better integrated and more flexible indexing than anything available
in R or MATLAB
@wesmckinn () Data analysis with pandas 10/17/2011 21 / 22
22. Tutorial time
To the IPython console!
@wesmckinn () Data analysis with pandas 10/17/2011 22 / 22