Array computing and the evolution of SciPy, NumPy, and PyData

© 2017 Continuum Analytics - Confidential & Proprietary
Array Computing and the Evolution of
SciPy, NumPy, and PyData
Travis E. Oliphant, PhD
February 13, 2020
travis@quansight.com
@teoliphant
Distinguished Lecture
Columbia University
travis@openteams.com

Published: February 3, 2020
Project
Started:
1998
Patience and
Persistence and
Grit

1998 20182001
2015
2009 20122005
…
2001
2006
SciPy, NumPy, and PyData Time-Line
1991
2003
2014
2008
2010 2016
2009

Started my career in computational science
Satellites Measure Backscatter
Computer Algorithms Produce
Estimate of Earth Features
• Wind Speed
• Ice Cover
• Vegetation
• (and more)

More Science led to Python
Raja Muthupillai
Armando Manduca
Richard Ehman
1997
Jim Greenleaf

First Project (1998 — )
Started as Multipack in 1998 and became
SciPy in 2001 with the help of other
colleagues
115 releases, 815 contributors
Used by: 156,525

SciPy
“Distribution of Python Numerical Tools masquerading as one Library”
Name Description
cluster KMeans and Vector Quantization
fftpack Discrete Fourier Transform
integrate Numerical Integration
interpolate Interpolation routines
io Data Input and Output
linalg Fast Linear algebra
misc Utilities
ndimage N-dimensional Image processing
Name Description
odr Orthogonal Distance Regression
optimize
Constrained and Unconstrained
Optimization
signal Signal Processing Tools
sparse Sparse Matrices and Algebra
spatial Spatial Data Structures and Algorithms
special Special functions (e.g. Bessel)
stats Statistical Functions and Distributions

Professor at BYU
Scanning Impedance Imaging

My Open Source
addiction continued…
Gave up my chance at tenured academic position
in 2005-2006 to bring together the diverging
array community in Python and bring Numeric
and Numarray together.
Used by: 314,759

NumPy: an Array Extension of Python
• Data: the array object
– slicing and shaping
– data-type map to bytes (dtype)
• Fast Math (ufuncs):
– vectorization
– broadcasting
– aggregations

Brief History of NumPy
Person Package Year
Jim Fulton Matrix Object 1994
Jim Hugunin Numeric 1995
Perry Greenfield,
Rick White,Todd
Miller
Numarray 2001
Travis Oliphant NumPy 2005

NumPy was created to unify array objects
in Python and unify PyData community
Numeric
Numarray
NumPy
I started this unification project and ended up sacrificing my tenure
at a University to write and release NumPy.

My little “side projects” became my life

Making “Array Oriented Programming” Popular
renamed
~20 million (Ana)conda users
spun-out

Past 5 years have seen a
resurgence of array-oriented
computing because of…
Machine Learning and AI

Java
JavaScript
Python
Google Search Trends
Jun 2019

NumPy
Tensorflow
Scikit Learn
PyTorch
NumPy
Pandas

Python and in particular PyData keeps Growing

Python’s Scientific Ecosystem
Bokeh
Jake Vanderplas PyCon 2017 Keynote

Not all open-source is the same!
Community-Driven
Open Source
Software (CDOSS)
Company-Backed
Open Source
Software (CBOSS)
• Anyone can become the leader.
• Multiple-stake holders.
• Can look at community size for health.
• Users become contributors more often.
• Examples:
• Jupyter
• NumPy
• SciPy
• Pandas
• Need to work at a company to be the
leader,
• Many users, fewer developers
• Need to understand incentive of company
to understand health
• Examples:
• Tensorflow
• PyTorch
• Conda
Both can be valuable, but have different implications!
Governance
models

Huge Impact (from diverse efforts of 1000s)
LIGO : Gravitional Waves
Higgs Boson
Discovery
Black Hole
Imaging

Example — Amazon Photo
Automatic Facial
recognition
User feedback
on face names
updates model

Neural network with
several layers trained
with ~130,000 images.
Matched trained
dermatologists with 91%
area under sensitivity-
specificity curve.
Keys:
• Access to Data
• Access to Software
• Access to Compute

Python
has taken
over!
Thanks to 1000s of
of my “closest”
friends who worked
on all the libraries
We won!
(sort of)

Downloads
49 Million
Estimated Cost
$7.57 Million
Contributors
866
Estimated Effort
76 person-years
3
Current Maintainers
Downloads
27.7 Million
Estimated Cost
$7 Million
Contributors
1,666
Estimated Effort
70 person-years
3
Current Maintainers
Downloads
13.8 Million
Estimated Cost
$6.63 Million
Contributors
860
Estimated Effort
64 person-years
2
Current Maintainers
Development began in 2003
The original developers were not paid to work on or improve these libraries!

OSS Sustainability
• Developers get “burned-out” when many
people use their tools but there is no
money to maintain or improve them.
• Developers can live unbalanced lives.
• Multi-billion dollar companies are
benefiting from volunteer labor and not
giving back.
• Foundational libraries are not maintained
and key insights from creators don’t get
back into the code.

For example: Here was my list for
NumPy in 2012
• NDArray improvements
• Indexes (esp. for Structured arrays)
• SQL front-end
• Multi-level, hierarchical labels
• selection via mappings (labeled arrays)
• Memory spaces (array made up of regions)
• Distributed arrays (global array)
• Compressed arrays
• Standard distributed persistence
• fancy indexing as view and optimizations
• streaming arrays
• Dtype improvements
• Enumerated types (including dynamic enumeration)
• Derived fields
• Specification as a class (or JSON)
• Pointer dtype (i.e. C++ object, or varchar)
• Finishing datetime
• Missing data with both bit-patterns and mask
• Parameterized field names
• Ufunc improvements
• Generalized ufuncs support more than just contiguous arrays
• Specification of ufuncs in Python
• Move most dtype “array functions” to ufuncs
• Unify error-handling for all computations
• Allow lazy-evaluation and remote computation --- streaming and generator data
• Structured and string dtype ufuncs
• Multi-core and GPU optimized ufuncs
• Group-by reduction

Multiple other unrealized epiphanies…
• In 2014, I finally realized how I should have built dtypes (inheriting from a new
“meta-type” so all NumPy “dtypes" are actually real Python types. This would
have eliminated the need for the “ugly” array-scalars (but semantically necessary
in the current system).
• NumPy should have a smaller interface API that other array libraries could
implement instead of the entire API becoming a de facto array API.
• GPU and parallel-executing UFuncs should be built-in
• Apply-by and reduce-by should be NumPy functions.
I’ve never received budget to work on NumPy or SciPy (until this year with a CZI grant from
Facebook). Part of this is because I pursued other entrepreneurial mechanisms to generate
resources, but part of this is because granting mechanisms are not setup to “maintain”
community-driven open-source software.

1. Python is the “Lingua Franca” for technical computing and
machine learning / AI
2. Python Reached this status because it embraced array-
oriented computing (NumPy and Pandas)
3. "Emergent” community-driven Open-source has a
sustainability problem.
Major Conclusions:
We (basically) realized our ultimate goal when we started SciPy in 1999!
But, we are still searching for the means to sustain.

What is array-oriented computing
• Organize data together logically (and in memory)
• Operate on “chunks” at a time with high-level
operations: (map, join, reduce, transform, apply,
filter)

Memory using Object-oriented
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3

Array-oriented (Table) approach
Attr1 Attr2 Attr3
Object1
Object2
Object3
Object4
Object5
Object6

Benefits of Array-oriented
• Many technical problems are naturally array-
oriented (easy to vectorize)
• Algorithms can be expressed at a high-level
• These algorithms can be parallelized more simply
(quite often much information is lost in the
translation to typical “compiled” languages)
• Array-oriented algorithms map to modern hard-
ware caches and pipelines.
• Software stack now starting to re-focus with ML
frameworks emerging.
• There is a reason Fortran remains popular.

NumPy Examples
2d array
3d array
[439 472 477]
[217 205 261 222 245 238]
9.98330639789 2.96677717122

NumPy Slicing (Selection)
>>> a[0,3:5]
array([3, 4])
>>> a[4:,4:]
array([[44, 45],
[54, 55]])
>>> a[:,2]
array([2,12,22,32,42,52])
50 51 52 53 54 55
40 41 42 43 44 45
30 31 32 33 34 35
20 21 22 23 24 25
10 11 12 13 14 15
0 1 2 3 4 5
>>> a[2::2,::2]
array([[20, 22, 24],
[40, 42, 44]])

Quick History life←{↑1 ⍵∨.∧3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵}
1966: APL
1984: APL2
1990: J
1993: K -> Q
2019: (new version of K)
1996: Numeric (Python)
2006: NumPy
2012: Numba
Arthur Whitney (used by KDB)
Arthur Whitney
Jim Hugunin
Travis Oliphant
Siu Kwan Lam
Ken Iverson
Ken Iverson (IBM)
IBM
APL
J
K Matlab
Numeric
NumPy

Putting Science back in Comp Sci
• Much of the software stack is for systems programming --- C++,
Java, .NET, ObjC, web
• This has been great for desktop computing but terrible for science:
- Complex numbers?
- Vectorized primitives?
- Multidimensional arrays?
• Array-oriented programming was supplanted by Object-oriented
programming
• Software stack for scientists was not as helpful as it should be
• Fortran is still where many scientists ended up
• Past 5 years this is changing with emergence of Python, Jupyter,
Pandas, PyTorch (we still have a long way to go).

Array-Oriented Computing
Example1: Fibonacci Numbers
fn = fn 1 + fn 2
f0 = 0
f1 = 1
f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .

Common Python approaches
Recursive
Iterative
Algorithm matters!!

Array-oriented approaches
Using LFilter
Using Formula

Conway’s game of Life
• Dead cell with exactly 3 live neighbors
will come to life
• A live cell with 2 or 3 neighbors will
survive
• With too few or too many neighbors, the
cell dies

Conway’s Game of Life
APL
NumPy
Initialization
Update Step
life←{↑1 ⍵∨.∧3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵}

Zen of NumPy
• strided is better than scattered
• contiguous is better than strided
• descriptive is better than imperative
• array-oriented is better than object-oriented
• broadcasting is a great idea
• vectorized is better than an explicit loop
• unless it’s too complicated or uses too much memory ---
then use Numba
• think in higher dimensions
Inspired by Tim Peter and “import this”

What is good about NumPy?
• Array-oriented
• Extensive Dtype System (including structures)
• C-API
• Simple to understand data-structure
• Memory mapping
• Syntax support from Python
• Large community of users
• Broadcasting
• Easy to interface C/C++/Fortran code

What is wrong with NumPy
• Dtype system is difﬁcult to extend
• Immediate mode creates huge temporaries
• “Almost” an in-memory data-base comparable to
SQL-lite (missing indexes)
• Integration with sparse arrays
• Lots of un-optimized parts
• Minimal support for multi-core / GPU
• Code-base is organic and hard to extend
• Tied to CPython run-time (doesn’t work on
other Python implementations)

Python Origins.
Version Date
0.9.0 Feb. 1991
0.9.4 Dec. 1991
0.9.6 Apr. 1992
0.9.8 Jan. 1993
1.0.0 Jan. 1994
1.2 Apr. 1995
1.4 Oct. 1996
1.5.2 Apr. 1999

How I got involved…
Getting data into memory — fast!
http://www.python.org/doc/essays/refcnt/
Reference Counting Essay
May 1998
Guido van Rossum
TableIO
April 1998
Michael A. Miller
NumPyIO
June 1998

How SciPy started…
Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington,
Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999.
In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in
earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy
Gaussian quadrature 5 Jan 1999
cephes 1.0 30 Jan 1999
sigtools 0.40 23 Feb 1999
Numeric docs March 1999
cephes 1.1 9 Mar 1999
multipack 0.3 13 Apr 1999
Helper routines 14 Apr 1999
multipack 0.6 (leastsq, ode, fsolve,
quad)
29 Apr 1999
sparse plan described 30 May 1999
multipack 0.7 14 Jun 1999
SparsePy 0.1 5 Nov 1999
cephes 1.2 (vectorize) 29 Dec 1999

Joined with others…
Started as Multipack in 1998 and became
SciPy in 2001 with the help of other
colleagues
Used by: 156,525

Don’t underestimate the importance of Team!
Anaconda success also depended on going from individual to a team
>700 contributors

Other People Matter
Know your model is incomplete:
• see people as “ends” not your “means”
• Believe in, love, and trust other people.
The Social Brain Hypothesis and Human Evolution, Robin I. M. Dunbar
Use your brain to adapt to other people
— this is why your brain is so big!
Hypothesis: You carry and update “models of
people" in your head. From very detailed to
approximate. Dunbar numbers!

Keep Open Mind: Be open to critique
dtype ctypes
PEP 3118 debate
over how to describe memory
vs.
Current me disagrees with past me!
I am glad there were others in the
debate.

Return good for evil
Hard because of our brains!

]
https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose
http://deeplearning.net/software_links/
http://scikit-learn.org/stable/related_projects.html
Explosion of ML Frameworks and libraries
TVM/NNVM

We have a “divided” community again!
Numeric
Numarray
NumPy

Examples of packages being built on
differing standards
FastAI
skorch
Pyro
Eduard
anyrl
Braid
PyMC4
Horovod
MLFlow
But note

Unification Efforts
Train the
Model
Deploy the
Model
Platform1
Platform 2
Deploy the
Model
Platform 3

NNVM / TVM — Ambitious Plan at UW

What is next? What am I
working on for the next 20
years…

Technology and Economic problems
1. General interoperability — low-level libraries that reduce silos of data and analysis
2. Better High-level APIs (more interfaces in Python supported by multiple
implementations)
3. Data Management — in particular Data Catalogues
4. Fixing Python’s Extension problem (the ecosystem helped Python grow but is also and
anchor to it’s progress)
5. How to connect the trillions of dollars of market capital to the innovation available in
global, emergent, open-source communities.

High Level APIs for Arrays (Tensors),
DataFrames, and DataTypes
LABS

The extensions are an anchor to
Python runtime progress!
CPython C-API

What will work!
• Create a statically typed subset
of Python that is then used to
extend Python — EPython
• Port NumPy, SciPy, Scikits to
EPython (borrow heavily from
Cython ideas but use mypy-style
typing instead of new syntax).

LABS
Sustaining the Future
Open-source innovation and
maintenance around the entire data-
science and AI workﬂow.
• NumPy ecosystem maintenance (PyData Core Team)
• Improve connection of NumPy to ML Frameworks
• GPU Support for NumPy Ecosystem
• Improve foundations of Array computing
• JupyterLab and JupyterHub
• Data Catalog standards
• Packaging (conda-forge, PyPA, etc.)
PySparse - sparse n-d arrays
Ibis - Pandas-like front-end to SQL
uarray — unified array interface for SciPy refactor
xnd — re-factored NumPy (low-level cross-language
libraries for N-D (tensor) computing)
Collaborating with NumFOCUS!
Bokeh
Adapted from Jake Vanderplas
PyCon 2017 Keynote

Build and Connect
Companies and
Communities to
Solve Challenging
Problems with Data
Enables me to keep working on array-
computing problems *and* meta-
problem of open-source funding.

Complete open-source
service consulting in the
PyData / NumFOCUS
ecosystem including
data-science and ML
We provide part-time
CTO work, custom
software, staff
augmentation, support,
training, staffing, and
mentoring
Open Source Research
Lab supporting the
NumFOCUS and
PyData Community.
Hiring developers,
evangelists, tech
writers, designers, and
product managers for
open-source projects.
Early stage funding
to companies that
provide return to
investors and
support open source
ecosystems with
industry disrupting
products and
services
Services Open Source Lab Venture Fund
Three Activities with One Mission

Some of the projects we support
Sparse
Fast Foundational ND-Array (Tensor)
object for Python
Extensive Library of Functions for NumPy
GPU-enabled Compiler for NumPy/Python
Parallel and Scaled Pandas and NumPy
DataFrames for general data-manipulation
and statistics and
Notebook environments for rapid
development and data analysis
Desktop IDE for data-science and
ML
Rapid development of Dashboards for
Python/PyData ecosystem.
Easy and fast web-based interactive
plots using Python.
Turn even very large datasets into
images, accurately.
General Sparse Arrays for Python
Cross-language libraries for array
computing
General and powerful symbolic mathematical
library
Very popular and powerful machine learning
library

An early stage venture
capital ﬁrm investing in
startups that build on
open-source technology
and support the
communities they depend
on (11 companies)
supporting
FairOSS
$20m fund

Problem
Open Source Teams
! Burned out
! Underrepresented
! Underpaid
Organizations
! Disconnected from
the Community
! Lack support and
maintenance
There’s no easy way to connect the
community with organizations

Open Source Marketplace
Managing Partners
! Provide Open Source Services
! Training / Support
! Feature development / fixes
Funding Partners
! Hire from the community
! Collectively fund
! Get support they need to
build effectively on open-
source.
Open-source Contributors create
profiles for themselves and their
projects and participate as actors in
the market.

FairOSS
A Public Benefit Company (goal is growing amount of freely available software)
• Owned by open-source contributors (will be doing a public fund-raise later this year)
• Those share-holders govern the organization (elect the board).
• Board appoints management and decides what is “fair”
Holds Companies accountable
• Allows usage of its trademarks only for companies that contribute back “fairly”
• Think “Kosher” or “Organic labeled”
• Companies give back by equity, revenue, and “in-kind” agreements with FairOSS
FairOSS is custodian of Revenue and Equity Agreements
• Equity agreements mean that FairOSS holds shares, options, or warrants of the company
(most companies are missing open-source community from their ‘cap-table’)
• Revenue agreements mean that companies pay FairOSS a portion of their revenue.
• FairOSS distributes almost all of the proceeds from these agreements to the open-
source communities.
If successful — this would make OpenSource investable and
make available >$23,000,000,000,000 (trillion) of investment
capital to open-source communities.

You can really change the world…
With Open Source Communities…
Let’s do more of that!

Array computing and the evolution of SciPy, NumPy, and PyData

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Array computing and the evolution of SciPy, NumPy, and PyData

Ähnlich wie Array computing and the evolution of SciPy, NumPy, and PyData (20)

Mehr von Travis Oliphant

Mehr von Travis Oliphant (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Array computing and the evolution of SciPy, NumPy, and PyData