SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Financial data analysis in Python with pandas

                  Wes McKinney
                   @wesmckinn


                    10/17/2011




@wesmckinn ()     Data analysis with pandas   10/17/2011   1 / 22
My background




   3 years as a quant hacker at AQR, now consultant / entrepreneur
   Math and statistics background with the zest of computer science
   Active in scientific Python community
   My blog: http://blog.wesmckinney.com
   Twitter: @wesmckinn




    @wesmckinn ()          Data analysis with pandas      10/17/2011   2 / 22
Bare essentials for financial research



    Fast time series functionality
         Easy data alignment
         Date/time handling
         Moving window statistics
         Resamping / frequency conversion
    Fast data access (SQL databases, flat files, etc.)
    Data visualization (plotting)
    Statistical models
         Linear regression
         Time series models: ARMA, VAR, ...




     @wesmckinn ()            Data analysis with pandas   10/17/2011   3 / 22
Would be nice to have




   Portfolio and risk analytics, backtesting
        Easy enough to write yourself, though most people do a bad job of it
   Portfolio optimization
        Most financial firms use a 3rd party library anyway
   Derivative pricing
        Can use QuantLib in most languages




    @wesmckinn ()            Data analysis with pandas          10/17/2011   4 / 22
What are financial firms using?



   HFT: a C++ and hardware arms race, a different topic
   Research
        Mainstream: R, MATLAB, Python, ...
        Econometrics: Stata, eViews, RATS, etc.
        Non-programmatic environments: ClariFI, Palantir, ...
   Production
        Popular: Java, C#, C++
        Less popular, but growing: Python
        Fringe: Functional languages (Ocaml, Haskell, F#)




    @wesmckinn ()            Data analysis with pandas          10/17/2011   5 / 22
What are financial firms using?



   Many hybrid languages environments (e.g. Java/R, C++/R,
   C++/MATLAB, Python/C++)
        Which is the main implementation language?
        If main language is Java/C++, result is lower productivity and higher
        cost to prototyping new functionality
   Trends
        Banks and hedge funds are realizing that Java-based production
        systems can be replaced with 20% as much Python code (or less)
        MATLAB is being increasingly ditched in favor of Python. R and
        Python use for research generally growing




    @wesmckinn ()            Data analysis with pandas          10/17/2011   6 / 22
Python language



   Simple, expressive syntax
   Designed for readability, like “runnable pseudocode”
   Easy-to-use, powerful built-in types and data structures:
        Lists and tuples (fixed-size, immutable lists)
        Dicts (hash maps / associative arrays) and sets
   Everything’s an object, including functions
   “There should be one, and preferably only one way to do it”
   “Batteries included”: great general purpose standard library




    @wesmckinn ()            Data analysis with pandas         10/17/2011   7 / 22
A simple example: quicksort


Pseudocode from Wikipedia:

function qsort(array)
    if length(array) < 2
        return array
    var list less, greater
    select and remove a pivot value pivot from array
    for each x in array
        if x < pivot then append x to less
        else append x to greater
    return concat(qsort(less), pivot, qsort(greater))




     @wesmckinn ()           Data analysis with pandas   10/17/2011   8 / 22
A simple example: quicksort

First try Python implementation:
def qsort ( array ):
    if len ( array ) < 2:
        return array

    less , greater = [] , []

    pivot , rest = array [0] , array [1:]

    for x in rest :
        if x < pivot :
             less . append ( x )
        else :
             greater . append ( x )

    return qsort ( less ) + [ pivot ] + qsort ( greater )



      @wesmckinn ()         Data analysis with pandas       10/17/2011   9 / 22
A simple example: quicksort



Use list comprehensions:
def qsort ( array ):
    if len ( array ) < 2:
        return array

     pivot , rest = array [0] , array [1:]
     less = [ x for x in rest if x < pivot ]
     greater = [ x for x in rest if x >= pivot ]

     return qsort ( less ) + [ pivot ] + qsort ( greater )




      @wesmckinn ()         Data analysis with pandas        10/17/2011   10 / 22
A simple example: quicksort




Heck, fit it onto one line!
qs = lambda r : ( r if len ( r ) < 2
                  else ( qs ([ x for x in r [1:] if x < r [0]])
                         + [ r [0]]
                         + qs ([ x for x in r [1:] if x >= r [0]])))

Though that’s starting to look like Lisp code...




      @wesmckinn ()           Data analysis with pandas   10/17/2011   11 / 22
A simple example: quicksort


A quicksort using NumPy arrays
def qsort ( array ):
    if len ( array ) < 2:
        return array
    pivot , rest = array [0] , array [1:]
    less = rest [ rest < pivot ]
    greater = rest [ rest >= pivot ]
    return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )]

Of course no need for this when you can just do:

sorted_array = np.sort(array)




      @wesmckinn ()           Data analysis with pandas       10/17/2011   12 / 22
Python: drunk with power
This comic has way too much airtime but:




     @wesmckinn ()          Data analysis with pandas   10/17/2011   13 / 22
Staples of Python for science: MINS




   (M) matplotlib: plotting and data visualization
   (I) IPython: rich interactive computing and development environment
   (N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random
   number generation, etc.
   (S) SciPy: optimization, probability distributions, signal processing,
   ODEs, sparse matrices, ...




     @wesmckinn ()          Data analysis with pandas        10/17/2011   14 / 22
Why did Python become popular in science?



   NumPy traces its roots to 1995
   Extremely easy to integrate C/C++/Fortran code
   Access fast low level algorithms in a high level, interpreted language
   The language itself
        “It fits in your head”
        “It [Python] doesn’t get in my way” - Robert Kern
   Python is good at all the things other scientific programming
   languages are not good at (e.g. networking, string processing, OOP)
   Liberal BSD license: can use Python for commercial applications




    @wesmckinn ()            Data analysis with pandas       10/17/2011   15 / 22
Some exciting stuff in the last few years


    Cython
         “Augmented” Python language with type declarations, for generating
         compiled extensions
         C-like speedups with Python-like development time
    IPython: enhanced interactive Python interpreter
         The best research and software development env for Python
         An integrated parallel / distributed computing backend
         GUI console with inline plotting and a rich HTML notebook (more on
         this later)
    PyCUDA / PyOpenCL: GPU computing in Python
         Transformed Python overnight into one of the best languages for doing
         GPU computing




     @wesmckinn ()            Data analysis with pandas        10/17/2011   16 / 22
Where has Python historically been weak?



   Rich data structures for data analysis and statistics
         NumPy arrays, while powerful, feel distinctly “lower level” if you’re
         used to R’s data.frame
         pandas has filled this gap over the last 2 years
   Statistics libraries
         Nowhere near the depth of R’s CRAN repository
         statsmodels provides tested implementations a lot of standard
         regression and time series models
         Turns out that most financial data analysis requires only fairly
         elementary statistical models




     @wesmckinn ()             Data analysis with pandas           10/17/2011    17 / 22
pandas library



    Began building at AQR in 2008, open-sourced late 2009
    Why
         R / MATLAB, while good for research / data analysis, are not suitable
         implementation languages for large-scale production systems
                (I personally don’t care for them for data analysis)
         Existing data structures for time series in R / MATLAB were too
         limited / not flexible enough my needs
    Core idea: indexed data structures capable of storing heterogeneous
    data
    Etymology: panel data structures




     @wesmckinn ()                Data analysis with pandas            10/17/2011   18 / 22
pandas in a nutshell



    A clean axis indexing design to support fast data alignment, lookups,
    hierarchical indexing, and more
    High-performance data structures
         Series/TimeSeries: 1D labeled vector
         DataFrame: 2D spreadsheet-like structure
         Panel: 3D labeled array, collection of DataFrames
    SQL-like functionality: GroupBy, joining/merging, etc.
    Missing data handling
    Time series functionality




     @wesmckinn ()              Data analysis with pandas    10/17/2011   19 / 22
pandas design philosophy



   “Think outside the matrix”: stop thinking about shape and start
   thinking about indexes
   Indexing and data alignment are essential
   Fault-tolerance: save you from common blunders caused by coding
   errors (specifically misaligned data)
   Lift the best features of other data analysis environments (R,
   MATLAB, Stata, etc.) and make them better, faster
   Performance and usability equally important




     @wesmckinn ()          Data analysis with pandas       10/17/2011   20 / 22
The pandas killer feature: indexing



    Each axis has an index
    Automatic alignment between differently-indexed objects: makes it
    nearly impossible to accidentally combine misaligned data
    Hierarchical indexing provides an intuitive way of structuring and
    working with higher-dimensional data
    Natural way of expressing “group by” and join-type operations
    Better integrated and more flexible indexing than anything available
    in R or MATLAB




     @wesmckinn ()           Data analysis with pandas       10/17/2011   21 / 22
Tutorial time




                     To the IPython console!




     @wesmckinn ()          Data analysis with pandas   10/17/2011   22 / 22

Weitere ähnliche Inhalte

Was ist angesagt?

Turtle graphics
Turtle graphicsTurtle graphics
Turtle graphics
grahamwell
 
02 c++ Array Pointer
02 c++ Array Pointer02 c++ Array Pointer
02 c++ Array Pointer
Tareq Hasan
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Was ist angesagt? (20)

Python Pandas
Python PandasPython Pandas
Python Pandas
 
Python Finance
Python FinancePython Finance
Python Finance
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Turtle graphics
Turtle graphicsTurtle graphics
Turtle graphics
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Python Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | EdurekaPython Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | Edureka
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
 
Machine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-offMachine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-off
 
Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With R
 
Variables & Data Types In Python | Edureka
Variables & Data Types In Python | EdurekaVariables & Data Types In Python | Edureka
Variables & Data Types In Python | Edureka
 
Data warehousing unit 1
Data warehousing unit 1Data warehousing unit 1
Data warehousing unit 1
 
Python Anaconda Tutorial | Edureka
Python Anaconda Tutorial | EdurekaPython Anaconda Tutorial | Edureka
Python Anaconda Tutorial | Edureka
 
02 c++ Array Pointer
02 c++ Array Pointer02 c++ Array Pointer
02 c++ Array Pointer
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data science
Data scienceData science
Data science
 
Python PPT
Python PPTPython PPT
Python PPT
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 

Ähnlich wie Python for Financial Data Analysis with pandas

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 

Ähnlich wie Python for Financial Data Analysis with pandas (20)

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 

Mehr von Wes McKinney

Mehr von Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Python for Financial Data Analysis with pandas

  • 1. Financial data analysis in Python with pandas Wes McKinney @wesmckinn 10/17/2011 @wesmckinn () Data analysis with pandas 10/17/2011 1 / 22
  • 2. My background 3 years as a quant hacker at AQR, now consultant / entrepreneur Math and statistics background with the zest of computer science Active in scientific Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn @wesmckinn () Data analysis with pandas 10/17/2011 2 / 22
  • 3. Bare essentials for financial research Fast time series functionality Easy data alignment Date/time handling Moving window statistics Resamping / frequency conversion Fast data access (SQL databases, flat files, etc.) Data visualization (plotting) Statistical models Linear regression Time series models: ARMA, VAR, ... @wesmckinn () Data analysis with pandas 10/17/2011 3 / 22
  • 4. Would be nice to have Portfolio and risk analytics, backtesting Easy enough to write yourself, though most people do a bad job of it Portfolio optimization Most financial firms use a 3rd party library anyway Derivative pricing Can use QuantLib in most languages @wesmckinn () Data analysis with pandas 10/17/2011 4 / 22
  • 5. What are financial firms using? HFT: a C++ and hardware arms race, a different topic Research Mainstream: R, MATLAB, Python, ... Econometrics: Stata, eViews, RATS, etc. Non-programmatic environments: ClariFI, Palantir, ... Production Popular: Java, C#, C++ Less popular, but growing: Python Fringe: Functional languages (Ocaml, Haskell, F#) @wesmckinn () Data analysis with pandas 10/17/2011 5 / 22
  • 6. What are financial firms using? Many hybrid languages environments (e.g. Java/R, C++/R, C++/MATLAB, Python/C++) Which is the main implementation language? If main language is Java/C++, result is lower productivity and higher cost to prototyping new functionality Trends Banks and hedge funds are realizing that Java-based production systems can be replaced with 20% as much Python code (or less) MATLAB is being increasingly ditched in favor of Python. R and Python use for research generally growing @wesmckinn () Data analysis with pandas 10/17/2011 6 / 22
  • 7. Python language Simple, expressive syntax Designed for readability, like “runnable pseudocode” Easy-to-use, powerful built-in types and data structures: Lists and tuples (fixed-size, immutable lists) Dicts (hash maps / associative arrays) and sets Everything’s an object, including functions “There should be one, and preferably only one way to do it” “Batteries included”: great general purpose standard library @wesmckinn () Data analysis with pandas 10/17/2011 7 / 22
  • 8. A simple example: quicksort Pseudocode from Wikipedia: function qsort(array) if length(array) < 2 return array var list less, greater select and remove a pivot value pivot from array for each x in array if x < pivot then append x to less else append x to greater return concat(qsort(less), pivot, qsort(greater)) @wesmckinn () Data analysis with pandas 10/17/2011 8 / 22
  • 9. A simple example: quicksort First try Python implementation: def qsort ( array ): if len ( array ) < 2: return array less , greater = [] , [] pivot , rest = array [0] , array [1:] for x in rest : if x < pivot : less . append ( x ) else : greater . append ( x ) return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 9 / 22
  • 10. A simple example: quicksort Use list comprehensions: def qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = [ x for x in rest if x < pivot ] greater = [ x for x in rest if x >= pivot ] return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 10 / 22
  • 11. A simple example: quicksort Heck, fit it onto one line! qs = lambda r : ( r if len ( r ) < 2 else ( qs ([ x for x in r [1:] if x < r [0]]) + [ r [0]] + qs ([ x for x in r [1:] if x >= r [0]]))) Though that’s starting to look like Lisp code... @wesmckinn () Data analysis with pandas 10/17/2011 11 / 22
  • 12. A simple example: quicksort A quicksort using NumPy arrays def qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = rest [ rest < pivot ] greater = rest [ rest >= pivot ] return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )] Of course no need for this when you can just do: sorted_array = np.sort(array) @wesmckinn () Data analysis with pandas 10/17/2011 12 / 22
  • 13. Python: drunk with power This comic has way too much airtime but: @wesmckinn () Data analysis with pandas 10/17/2011 13 / 22
  • 14. Staples of Python for science: MINS (M) matplotlib: plotting and data visualization (I) IPython: rich interactive computing and development environment (N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random number generation, etc. (S) SciPy: optimization, probability distributions, signal processing, ODEs, sparse matrices, ... @wesmckinn () Data analysis with pandas 10/17/2011 14 / 22
  • 15. Why did Python become popular in science? NumPy traces its roots to 1995 Extremely easy to integrate C/C++/Fortran code Access fast low level algorithms in a high level, interpreted language The language itself “It fits in your head” “It [Python] doesn’t get in my way” - Robert Kern Python is good at all the things other scientific programming languages are not good at (e.g. networking, string processing, OOP) Liberal BSD license: can use Python for commercial applications @wesmckinn () Data analysis with pandas 10/17/2011 15 / 22
  • 16. Some exciting stuff in the last few years Cython “Augmented” Python language with type declarations, for generating compiled extensions C-like speedups with Python-like development time IPython: enhanced interactive Python interpreter The best research and software development env for Python An integrated parallel / distributed computing backend GUI console with inline plotting and a rich HTML notebook (more on this later) PyCUDA / PyOpenCL: GPU computing in Python Transformed Python overnight into one of the best languages for doing GPU computing @wesmckinn () Data analysis with pandas 10/17/2011 16 / 22
  • 17. Where has Python historically been weak? Rich data structures for data analysis and statistics NumPy arrays, while powerful, feel distinctly “lower level” if you’re used to R’s data.frame pandas has filled this gap over the last 2 years Statistics libraries Nowhere near the depth of R’s CRAN repository statsmodels provides tested implementations a lot of standard regression and time series models Turns out that most financial data analysis requires only fairly elementary statistical models @wesmckinn () Data analysis with pandas 10/17/2011 17 / 22
  • 18. pandas library Began building at AQR in 2008, open-sourced late 2009 Why R / MATLAB, while good for research / data analysis, are not suitable implementation languages for large-scale production systems (I personally don’t care for them for data analysis) Existing data structures for time series in R / MATLAB were too limited / not flexible enough my needs Core idea: indexed data structures capable of storing heterogeneous data Etymology: panel data structures @wesmckinn () Data analysis with pandas 10/17/2011 18 / 22
  • 19. pandas in a nutshell A clean axis indexing design to support fast data alignment, lookups, hierarchical indexing, and more High-performance data structures Series/TimeSeries: 1D labeled vector DataFrame: 2D spreadsheet-like structure Panel: 3D labeled array, collection of DataFrames SQL-like functionality: GroupBy, joining/merging, etc. Missing data handling Time series functionality @wesmckinn () Data analysis with pandas 10/17/2011 19 / 22
  • 20. pandas design philosophy “Think outside the matrix”: stop thinking about shape and start thinking about indexes Indexing and data alignment are essential Fault-tolerance: save you from common blunders caused by coding errors (specifically misaligned data) Lift the best features of other data analysis environments (R, MATLAB, Stata, etc.) and make them better, faster Performance and usability equally important @wesmckinn () Data analysis with pandas 10/17/2011 20 / 22
  • 21. The pandas killer feature: indexing Each axis has an index Automatic alignment between differently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations Better integrated and more flexible indexing than anything available in R or MATLAB @wesmckinn () Data analysis with pandas 10/17/2011 21 / 22
  • 22. Tutorial time To the IPython console! @wesmckinn () Data analysis with pandas 10/17/2011 22 / 22