3. Maintenance Problem — Funding for
Community Devs
Full-time: 2 Full-time: 0
Full-time: 1/2
Open Source is too important to be just left to volunteer time — current situation is not working to
sustain millions of users:
• No funding for creators of these libraries to continue their work
• GPU support could have been added to NumPy years ago
• SciPy took 17 years to hit 1.0
• NumPy should already be at 2.0 — but not without full-time guidance and leadership
Full-time: 2
Full-time: 0
4. Company
2012 - Created Two Orgs for Sustainability
Community
Enterprise software company initially
built on services and supporting
open-source.
Became
5. Quansight — continuing Continuum momentum
Replaced by
Spin Out
Incubate
2012
2018
?
?
Key. Members of the management team at Continuum
Analytics ==> Anaconda was our first (spin-out) company.
2015
2019 and beyond…
6. Build and Connect
Companies and
Communities to
Solve Challenging
Problems with Data
Continuing my quest to find more
ways to pay developers to work on
open source!
7. Open Source Directions
Webinar series to promote and encourage accessible publicity
about what community developers are thinking about.
8. LABS
Sustaining the Future
Open-source innovation and
maintenance around the entire data-
science and AI workflow.
• Hire and fund a “PyData Core Team”
• GPU Support for NumPy Ecosystem
• Improve foundations of Array computing
• JupyterLab development and plugins
• Data Catalog standards and demos
• Packaging (conda-forge, PyPA, etc.)
• Cross Language Integration
uarray — unified array interface and “symbolic" NumPy
xnd — re-factored NumPy (low-level cross-language
libraries for N-D (tensor) computing)
Partnered with NumFOCUS and
Ursa Labs (supporting Apache
Arrow)
Bokeh
Adapted from Jake Vanderplas
PyCon 2017 Keynote
http://quansight.com/labs
16. We have a “divided” community again!
Numeric
Numarray
NumPy
17. Real problem is packages have little re-use
FastAI
skorch
Pyro
Eduard
anyrl
Braid
PyMC4
MLFlow
torchdiffeq
18. Two additional efforts in 2006
Buffer Protocol (PEP 3118)
__array_interface__
Way for all Python objects to share memory using
NumPy-like data-structures (strided memory layout
with a shape). “memoryview”
Type system not solved at the time (punted to the
struct module syntax extended with character
codes)
(“I 2s f”) == dtype(‘u4, 2S, f’)
Protocol approach. Any object can define this
attribute to explain how it could be interpreted as
an array — still tied to NumPy structure (strided
layout)
19. What if we revisit these earlier efforts
Buffer Protocol (PEP 3118)
__array_interface__
Cross-language buffer-protocol
plus numpy-like math libraries
uarray
New project to formalize and
generalize array protocol for Python
while that downstream projects can
depend on (rather than a single array)
20. NumPy’s Key Parts
dtype
umath
ndarray
Description of what is “in the array” — data-description language but missing key
primitives (pointer, missing-data types, categoricals, new float types, etc.)
Strictly extensible —- but not easily.
Innovation was ability to map to any memory pointer that you could describe via
dtype “language” and then “slice and dice”
Pointer to data described by “dtype” with shape and strides information and
powerful “indexing” capabilities.
Mapping pointers to the start of a data-structure you can describe with dtype and
then applying (generalized) ufuncs is the essence of array-oriented computing
Math and functions for arrays. Started as “scalar” kernels (ufuncs) that are applied
over the array.
DEShaw added “generalized ufuncs” which allowed the kernel applied over the array
to involve “inner-dimensions” (i.e. dot, cholesky, svd, argmax, can be a kernel)
21. libndtypes libgumath libxnd
C-libraries with
defined API/ABI
Language Bindings
(Python, Ruby, …) ndtypes gumath xnd
Generalization of
dtype. Description of
“any” container
Generalization of numpy array container and
Universal functions (arbitray kernels applied
over the data)
Need: C++, Scala, Node,
F#, C#, Go, Java
Not a NumPy replacement — but could be used by NumPy!
22. Is a generalization of Arrow — you could describe an Arrow container with XND
Like Pandas columns are NumPy arrays.
23. Unified (or Universal) Array Interface
Need to fix the “string / bytes” problem of the array world!
Logical array vs. strided-pointer of numpy
“uarray”
interface
……
CuPy
24. Big Hairy Audacious Goal (BHAG)
Enhance the Array ecosystem (initially for Python) with an abstract interface
that downstream libraries can use (with a concrete interface based on xnd).
• Reuse as much of the existing ecosystem as possible.
• Easily allow multiple implementations of an array (sparse,
hardware-backed, delayed) with a common interface.
• Libraries (e.g. SciPy and PyData) that depend only on the
interface could be compiled down to hardware or use a backend
runtime.
25. Collaboration with Mathematics
Apply reduction rules from the "Mathematics of Arrays” on code that uses the
array_interface.
Lenore Mullin worked with Ken Iverson on APL and has since developed a
formal mathematics of arrays that shows how arbitrary array-based
cacluations (based on the Psi function) can be consistently defined,
simplified and formalized to be optimally implemented on arbitrary
hardware.
https://www.researchgate.net/profile/Lenore_Mullin https://arxiv.org/abs/0907.0796
Tensors and n-d Arrays: A Mathematics of Arrays
(MoA), psi-Calculus and the Composition of Tensor
and Array Operations
27. Current NumPy (API is huge…)
• Generalized ufuncs on top of this including Segmentation (Grouping) and
reduction
• Input/Output Rules for reducing and simplifying functions
• Method for defining pipelines of functions (with automatic differentiation)
Compute/Transform Creation/Reading Reporting/Output
Indexing/Subsetting MetaData/Attributes
Other Total
Functions 33 7 6 12 11 2 71
Methods 226 170 22 38 21 68 545
NumPy API
28. What is an array (or tensor)?
Fundamental concepts:
• shape (a named tuple)
• a function that takes a tuple of indexes and returns another array (Psi function)
• A (“dtype” or “memory-type”) (what are the elements)
• Math that works with arrays.
Other important concepts:
• for each dimension an “index” mapping from index space to 0..N-1 (labels)
• Data pointer (including device ID)
• Slicing, sub-selection, and indexing capability
• conversion from (0-d array) to Python scalar type
• Optional bit-array for masking missing data
• Functions for concatenation
• Functions for creating and filling the array (from a file, from a string, from Python
objects, from ODBC)
Core API that might be necessary
29. First part of the general Idea
__uarray__ —> return an object that implements the array interface
uarray interface: (strawperson phase…)
required
__u_psi__ : function mapping from a sequence of integers to an mtype
__u_shape__: a named tuple showing the shape of the uarray (or None if unknown)
__u_mtype__ : What this array contains: The Python type object in each element of the array
__u_attr__ : named tuple of attributes (version, ndim, jagged, strided, c-like, f-like, …)
optional
__u_llvm__ : return named tuple of llvm snippets for psi function
__u_llfuncs__ : return named tuple of low-level function pointers
__u_psi_dim__: function mapping from an sequence of integers to a __uarray__ one dimension smaller
__u_setelement__ : a function that sets an element of the array with an object of type mtype
__u_getelement__ : a function that gets
__u_fromiter__ : function to build a array from an iterator
__u_frombuffer__ : function to build a “gamma-based” uarray from a buffer
__u_concat__ : concatenate a sequence of __uarray__ objects along an axis
30. Core C-API from NumPy
PyArray_FromAny
PyArray_Shape
PyArray_New
PyArray_Fill
PyArray_Copy
PyArray_Take
PyArray_Put
PyArray_NDIM
PyArray_GETITEM
PyArray_SETITEM
…
* EquivArrTypes
* GetItem
* SetItem
* CopySwapN
* CopySwap
* ScanFunc
* FromStr
* FillFunc
* meta-data
Core Array Container DTtype
Basic Idea: Provide a place for these function-pointers in Python TypeObject
31. Start of a Proposal
Core Array Container
dtype Mtype
tp_as_ndarray
PyNDArrayMethods
Analagous to PySequenceMethods
Standardized function pointers for “bits”
In an “element” of a data-structure.
Inherit from PyHeapTypeObject