5. Started my career in computational science
Satellites Measure Backscatter
Computer Algorithms Produce
Estimate of Earth Features
• Wind Speed
• Ice Cover
• Vegetation
• (and more)
6. More Science led to Python
Raja Muthupillai
Armando Manduca
Richard Ehman
1997
Jim Greenleaf
7. First Project (1998 — )
Started as Multipack in 1998 and became
SciPy in 2001 with the help of other
colleagues
115 releases, 815 contributors
Used by: 156,525
8. SciPy
“Distribution of Python Numerical Tools masquerading as one Library”
Name Description
cluster KMeans and Vector Quantization
fftpack Discrete Fourier Transform
integrate Numerical Integration
interpolate Interpolation routines
io Data Input and Output
linalg Fast Linear algebra
misc Utilities
ndimage N-dimensional Image processing
Name Description
odr Orthogonal Distance Regression
optimize
Constrained and Unconstrained
Optimization
signal Signal Processing Tools
sparse Sparse Matrices and Algebra
spatial Spatial Data Structures and Algorithms
special Special functions (e.g. Bessel)
stats Statistical Functions and Distributions
10. My Open Source
addiction continued…
Gave up my chance at tenured academic position
in 2005-2006 to bring together the diverging
array community in Python and bring Numeric
and Numarray together.
166 releases, 866 contributors
Used by: 314,759
11. NumPy: an Array Extension of Python
• Data: the array object
– slicing and shaping
– data-type map to bytes (dtype)
• Fast Math (ufuncs):
– vectorization
– broadcasting
– aggregations
12. Brief History of NumPy
Person Package Year
Jim Fulton Matrix Object 1994
Jim Hugunin Numeric 1995
Perry Greenfield,
Rick White,Todd
Miller
Numarray 2001
Travis Oliphant NumPy 2005
13. NumPy was created to unify array objects
in Python and unify PyData community
Numeric
Numarray
NumPy
I started this unification project and ended up sacrificing my tenure
at a University to write and release NumPy.
21. Not all open-source is the same!
Community-Driven
Open Source
Software (CDOSS)
Company-Backed
Open Source
Software (CBOSS)
• Anyone can become the leader.
• Multiple-stake holders.
• Can look at community size for health.
• Users become contributors more often.
• Examples:
• Jupyter
• NumPy
• SciPy
• Pandas
• Need to work at a company to be the
leader,
• Many users, fewer developers
• Need to understand incentive of company
to understand health
• Examples:
• Tensorflow
• PyTorch
• Conda
Both can be valuable, but have different implications!
Governance
models
22. Huge Impact (from diverse efforts of 1000s)
LIGO : Gravitional Waves
Higgs Boson
Discovery
Black Hole
Imaging
24. Example — Amazon Photo
Automatic Facial
recognition
User feedback
on face names
updates model
25. Neural network with
several layers trained
with ~130,000 images.
Matched trained
dermatologists with 91%
area under sensitivity-
specificity curve.
Keys:
• Access to Data
• Access to Software
• Access to Compute
27. Downloads
49 Million
Estimated Cost
$7.57 Million
Contributors
866
Estimated Effort
76 person-years
3
Current Maintainers
Downloads
27.7 Million
Estimated Cost
$7 Million
Contributors
1,666
Estimated Effort
70 person-years
3
Current Maintainers
Downloads
13.8 Million
Estimated Cost
$6.63 Million
Contributors
860
Estimated Effort
64 person-years
2
Current Maintainers
Development began in 2003
Development began in 2005
Development began in 2008
The original developers were not paid to work on or improve these libraries!
28. OSS Sustainability
• Developers get “burned-out” when many
people use their tools but there is no
money to maintain or improve them.
• Developers can live unbalanced lives.
• Multi-billion dollar companies are
benefiting from volunteer labor and not
giving back.
• Foundational libraries are not maintained
and key insights from creators don’t get
back into the code.
29. For example: Here was my list for
NumPy in 2012
• NDArray improvements
• Indexes (esp. for Structured arrays)
• SQL front-end
• Multi-level, hierarchical labels
• selection via mappings (labeled arrays)
• Memory spaces (array made up of regions)
• Distributed arrays (global array)
• Compressed arrays
• Standard distributed persistence
• fancy indexing as view and optimizations
• streaming arrays
• Dtype improvements
• Enumerated types (including dynamic enumeration)
• Derived fields
• Specification as a class (or JSON)
• Pointer dtype (i.e. C++ object, or varchar)
• Finishing datetime
• Missing data with both bit-patterns and mask
• Parameterized field names
• Ufunc improvements
• Generalized ufuncs support more than just contiguous arrays
• Specification of ufuncs in Python
• Move most dtype “array functions” to ufuncs
• Unify error-handling for all computations
• Allow lazy-evaluation and remote computation --- streaming and generator data
• Structured and string dtype ufuncs
• Multi-core and GPU optimized ufuncs
• Group-by reduction
30. Multiple other unrealized epiphanies…
• In 2014, I finally realized how I should have built dtypes (inheriting from a new
“meta-type” so all NumPy “dtypes" are actually real Python types. This would
have eliminated the need for the “ugly” array-scalars (but semantically necessary
in the current system).
• NumPy should have a smaller interface API that other array libraries could
implement instead of the entire API becoming a de facto array API.
• GPU and parallel-executing UFuncs should be built-in
• Apply-by and reduce-by should be NumPy functions.
I’ve never received budget to work on NumPy or SciPy (until this year with a CZI grant from
Facebook). Part of this is because I pursued other entrepreneurial mechanisms to generate
resources, but part of this is because granting mechanisms are not setup to “maintain”
community-driven open-source software.
31. 1. Python is the “Lingua Franca” for technical computing and
machine learning / AI
2. Python Reached this status because it embraced array-
oriented computing (NumPy and Pandas)
3. "Emergent” community-driven Open-source has a
sustainability problem.
Major Conclusions:
We (basically) realized our ultimate goal when we started SciPy in 1999!
But, we are still searching for the means to sustain.
32. What is array-oriented computing
• Organize data together logically (and in memory)
• Operate on “chunks” at a time with high-level
operations: (map, join, reduce, transform, apply,
filter)
35. Benefits of Array-oriented
• Many technical problems are naturally array-
oriented (easy to vectorize)
• Algorithms can be expressed at a high-level
• These algorithms can be parallelized more simply
(quite often much information is lost in the
translation to typical “compiled” languages)
• Array-oriented algorithms map to modern hard-
ware caches and pipelines.
• Software stack now starting to re-focus with ML
frameworks emerging.
• There is a reason Fortran remains popular.
39. Quick History life←{↑1 ⍵∨.∧3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵}
1966: APL
1984: APL2
1990: J
1993: K -> Q
2019: (new version of K)
1996: Numeric (Python)
2006: NumPy
2012: Numba
Arthur Whitney (used by KDB)
Arthur Whitney
Jim Hugunin
Travis Oliphant
Siu Kwan Lam
Ken Iverson
Ken Iverson (IBM)
IBM
APL
J
K Matlab
Numeric
NumPy
40. Putting Science back in Comp Sci
• Much of the software stack is for systems programming --- C++,
Java, .NET, ObjC, web
• This has been great for desktop computing but terrible for science:
- Complex numbers?
- Vectorized primitives?
- Multidimensional arrays?
• Array-oriented programming was supplanted by Object-oriented
programming
• Software stack for scientists was not as helpful as it should be
• Fortran is still where many scientists ended up
• Past 5 years this is changing with emergence of Python, Jupyter,
Pandas, PyTorch (we still have a long way to go).
45. Conway’s game of Life
• Dead cell with exactly 3 live neighbors
will come to life
• A live cell with 2 or 3 neighbors will
survive
• With too few or too many neighbors, the
cell dies
47. Conway’s Game of Life
APL
NumPy
Initialization
Update Step
life←{↑1 ⍵∨.∧3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵}
48. Zen of NumPy
• strided is better than scattered
• contiguous is better than strided
• descriptive is better than imperative
• array-oriented is better than object-oriented
• broadcasting is a great idea
• vectorized is better than an explicit loop
• unless it’s too complicated or uses too much memory ---
then use Numba
• think in higher dimensions
Inspired by Tim Peter and “import this”
49. What is good about NumPy?
• Array-oriented
• Extensive Dtype System (including structures)
• C-API
• Simple to understand data-structure
• Memory mapping
• Syntax support from Python
• Large community of users
• Broadcasting
• Easy to interface C/C++/Fortran code
50. What is wrong with NumPy
• Dtype system is difficult to extend
• Immediate mode creates huge temporaries
• “Almost” an in-memory data-base comparable to
SQL-lite (missing indexes)
• Integration with sparse arrays
• Lots of un-optimized parts
• Minimal support for multi-core / GPU
• Code-base is organic and hard to extend
• Tied to CPython run-time (doesn’t work on
other Python implementations)
52. How I got involved…
Getting data into memory — fast!
http://www.python.org/doc/essays/refcnt/
Reference Counting Essay
May 1998
Guido van Rossum
TableIO
April 1998
Michael A. Miller
NumPyIO
June 1998
53. How SciPy started…
Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington,
Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999.
In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in
earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy
Gaussian quadrature 5 Jan 1999
cephes 1.0 30 Jan 1999
sigtools 0.40 23 Feb 1999
Numeric docs March 1999
cephes 1.1 9 Mar 1999
multipack 0.3 13 Apr 1999
Helper routines 14 Apr 1999
multipack 0.6 (leastsq, ode, fsolve,
quad)
29 Apr 1999
sparse plan described 30 May 1999
multipack 0.7 14 Jun 1999
SparsePy 0.1 5 Nov 1999
cephes 1.2 (vectorize) 29 Dec 1999
54. Joined with others…
Started as Multipack in 1998 and became
SciPy in 2001 with the help of other
colleagues
115 releases, 815 contributors
Used by: 156,525
55. Don’t underestimate the importance of Team!
Anaconda success also depended on going from individual to a team
>700 contributors
56. Other People Matter
Know your model is incomplete:
• see people as “ends” not your “means”
• Believe in, love, and trust other people.
The Social Brain Hypothesis and Human Evolution, Robin I. M. Dunbar
Use your brain to adapt to other people
— this is why your brain is so big!
Hypothesis: You carry and update “models of
people" in your head. From very detailed to
approximate. Dunbar numbers!
57. Keep Open Mind: Be open to critique
dtype ctypes
PEP 3118 debate
over how to describe memory
vs.
Current me disagrees with past me!
I am glad there were others in the
debate.
64. What is next? What am I
working on for the next 20
years…
65. Technology and Economic problems
1. General interoperability — low-level libraries that reduce silos of data and analysis
2. Better High-level APIs (more interfaces in Python supported by multiple
implementations)
3. Data Management — in particular Data Catalogues
4. Fixing Python’s Extension problem (the ecosystem helped Python grow but is also and
anchor to it’s progress)
5. How to connect the trillions of dollars of market capital to the innovation available in
global, emergent, open-source communities.
66. High Level APIs for Arrays (Tensors),
DataFrames, and DataTypes
LABS
68. What will work!
• Create a statically typed subset
of Python that is then used to
extend Python — EPython
• Port NumPy, SciPy, Scikits to
EPython (borrow heavily from
Cython ideas but use mypy-style
typing instead of new syntax).
69. LABS
Sustaining the Future
Open-source innovation and
maintenance around the entire data-
science and AI workflow.
• NumPy ecosystem maintenance (PyData Core Team)
• Improve connection of NumPy to ML Frameworks
• GPU Support for NumPy Ecosystem
• Improve foundations of Array computing
• JupyterLab and JupyterHub
• Data Catalog standards
• Packaging (conda-forge, PyPA, etc.)
PySparse - sparse n-d arrays
Ibis - Pandas-like front-end to SQL
uarray — unified array interface for SciPy refactor
xnd — re-factored NumPy (low-level cross-language
libraries for N-D (tensor) computing)
Collaborating with NumFOCUS!
Bokeh
Adapted from Jake Vanderplas
PyCon 2017 Keynote
70. Build and Connect
Companies and
Communities to
Solve Challenging
Problems with Data
Enables me to keep working on array-
computing problems *and* meta-
problem of open-source funding.
71. Complete open-source
service consulting in the
PyData / NumFOCUS
ecosystem including
data-science and ML
We provide part-time
CTO work, custom
software, staff
augmentation, support,
training, staffing, and
mentoring
Open Source Research
Lab supporting the
NumFOCUS and
PyData Community.
Hiring developers,
evangelists, tech
writers, designers, and
product managers for
open-source projects.
Early stage funding
to companies that
provide return to
investors and
support open source
ecosystems with
industry disrupting
products and
services
Services Open Source Lab Venture Fund
Three Activities with One Mission
72. Some of the projects we support
Sparse
Fast Foundational ND-Array (Tensor)
object for Python
Extensive Library of Functions for NumPy
GPU-enabled Compiler for NumPy/Python
Parallel and Scaled Pandas and NumPy
DataFrames for general data-manipulation
and statistics and
Notebook environments for rapid
development and data analysis
Desktop IDE for data-science and
ML
Rapid development of Dashboards for
Python/PyData ecosystem.
Easy and fast web-based interactive
plots using Python.
Turn even very large datasets into
images, accurately.
General Sparse Arrays for Python
Cross-language libraries for array
computing
General and powerful symbolic mathematical
library
Very popular and powerful machine learning
library
73. An early stage venture
capital firm investing in
startups that build on
open-source technology
and support the
communities they depend
on (11 companies)
supporting
FairOSS
$20m fund
74. Problem
Open Source Teams
! Burned out
! Underrepresented
! Underpaid
Organizations
! Disconnected from
the Community
! Lack support and
maintenance
There’s no easy way to connect the
community with organizations
75. Open Source Marketplace
Managing Partners
! Provide Open Source Services
! Training / Support
! Feature development / fixes
Funding Partners
! Hire from the community
! Collectively fund
! Get support they need to
build effectively on open-
source.
Open-source Contributors create
profiles for themselves and their
projects and participate as actors in
the market.
76.
77. FairOSS
A Public Benefit Company (goal is growing amount of freely available software)
• Owned by open-source contributors (will be doing a public fund-raise later this year)
• Those share-holders govern the organization (elect the board).
• Board appoints management and decides what is “fair”
Holds Companies accountable
• Allows usage of its trademarks only for companies that contribute back “fairly”
• Think “Kosher” or “Organic labeled”
• Companies give back by equity, revenue, and “in-kind” agreements with FairOSS
FairOSS is custodian of Revenue and Equity Agreements
• Equity agreements mean that FairOSS holds shares, options, or warrants of the company
(most companies are missing open-source community from their ‘cap-table’)
• Revenue agreements mean that companies pay FairOSS a portion of their revenue.
• FairOSS distributes almost all of the proceeds from these agreements to the open-
source communities.
If successful — this would make OpenSource investable and
make available >$23,000,000,000,000 (trillion) of investment
capital to open-source communities.
78. You can really change the world…
With Open Source Communities…
Let’s do more of that!