Using Anaconda to light up dark data. My talk given to the Berkeley Institute of Data Science describing Anaconda and the Blaze ecosystem for bringing a virtual analytical database to your data.
2. Started as a Scientist / Engineer
2
Images from BYU CERS Lab
3. Science led to Python
3
Raja Muthupillai
Armando Manduca
Richard Ehman
Jim Greenleaf
1997
⇢0 (2⇡f)
2
Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)],j
⌅ = r ⇥ U
34. 34
interface to query data on different storage systems
http://blaze.pydata.org/en/latest/
from blaze import Data
Blaze
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy,
pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
36. 36
datashapeblaze
Blaze uses datashape as its type system (like DyND)
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")
37. 37
a structured data description language
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
* var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
Data Shape
39. iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount:
float64}"
irismongo:
source: mongodb://localhost/mydb::iris
server.yaml
YAML
39
Builds off of Blaze uniform interface to host data
remotely through a JSON web API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
Blaze Server — Lights up your Dark Data
55. 55
semi-structure data, like
JSON blobs or log files
>>> import dask.bag as db
>>> import json
# Get tweets as a dask.bag from compressed json files
>>> b = db.from_filenames('*.json.gz').map(json.loads)
# Take two items in dask.bag
>>> b.take(2)
({u'contributors': None,
u'coordinates': None,
u'created_at': u'Fri Oct 10 17:19:35 +0000 2014',
u'entities': {u'hashtags': [],
u'symbols': [],
u'trends': [],
u'urls': [],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'filter_level': u'medium',
u'geo': None …
# Count the frequencies of user locations
>>> freq = b.pluck('user').pluck('location').frequencies()
# Get the result as a dataframe
>>> df = freq.to_dataframe()
>>> df.compute()
0 1
0 20916
1 Natal 2
2 Planet earth. Sheffield. 1
3 Mad, USERA 1
4 Brasilia DF - Brazil 2
5 Rondonia Cacoal 1
6 msftsrep || 4/5. 1
dask bag
56. 56
>>> import dask
>>> from dask.distributed import Client
# client connected to 50 nodes, 2 workers per node.
>>> dc = Client('tcp://localhost:9000')
# or
>>> dc = Client('tcp://ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:9000')
>>> b = db.from_s3('githubarchive-data', '2015-*.json.gz').map(json.loads)
# use default single node scheduler
>>> top_commits.compute()
# use client with distributed cluster
>>> top_commits.compute(get=dc.get)
[(u'mirror-updates', 1463019),
(u'KenanSulayman', 235300),
(u'greatfirebot', 167558),
(u'rydnr', 133323),
(u'markkcc', 127625)]
dask distributed
57. 57
daskblaze
e.g. we can drive dask arrays with blaze.
>>> x = da.from_array(...) # Make a dask array
>>> from blaze import Data, log, compute
>>> d = Data(x) # Wrap with Blaze
>>> y = log(d + 1)[:5].sum(axis=1) # Do work as usual
>>> result = compute(y) # Fall back to dask
dask can be a backend/engine for blaze
61. Space of Python Compilation
61
Ahead Of Time Just In Time
Relies on
CPython /
libpython
Cython
Shedskin
Nuitka (today)
Pythran
Numba
Numba
HOPE
Theano
Pyjion
Replaces
CPython /
libpython
Nuitka (future) Pyston
PyPy
66. Numba Features
66
• Numba supports:
Windows, OS X, and Linux
32 and 64-bit x86 CPUs and NVIDIA GPUs
Python 2 and 3
NumPy versions 1.6 through 1.9
• Does not require a C/C++ compiler on the user’s system.
• < 70 MB to install.
• Does not replace the standard Python interpreter
(all of your existing Python libraries are still available)
67. Numba Modes
67
• object mode: Compiled code operates on Python
objects. Only significant performance improvement is
compilation of loops that can be compiled in nopython
mode (see below).
• nopython mode: Compiled code operates on “machine
native” data. Usually within 25% of the performance of
equivalent C or FORTRAN.
68. How to Use Numba
68
1. Create a realistic benchmark test case.
(Do not use your unit tests as a benchmark!)
2. Run a profiler on your benchmark.
(cProfile is a good choice)
3. Identify hotspots that could potentially be compiled by Numba with a
little refactoring.
(see rest of this talk and online documentation)
4. Apply @numba.jit and @numba.vectorize as needed to critical
functions.
(Small rewrites may be needed to work around Numba limitations.)
5. Re-run benchmark to check if there was a performance improvement.
69. A Whirlwind Tour of Numba Features
69
• Sometimes you can’t create a simple or efficient array
expression or ufunc. Use Numba to work with array
elements directly.
• Example: Suppose you have a boolean grid and you
want to find the maximum number neighbors a cell has
in the grid:
71. The Basics
71
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup!
Numba decorator
(nopython=True not required)
76. Case-study -- j0 from scipy.special
76
• scipy.special was one of the first libraries I wrote (in 1999)
• extended “umath” module by adding new “universal functions” to
compute many scientific functions by wrapping C and Fortran libs.
• Bessel functions are solutions to a differential equation:
x2 d2
y
dx2
+ x
dy
dx
+ (x2
↵2
)y = 0
y = J↵ (x)
Jn (x) =
1
⇡
Z ⇡
0
cos (n⌧ x sin (⌧)) d⌧
78. Result --- equivalent to compiled code
78
In [6]: %timeit vj0(x)
10000 loops, best of 3: 75 us per loop
In [7]: from scipy.special import j0
In [8]: %timeit j0(x)
10000 loops, best of 3: 75.3 us per loop
But! Now code is in Python and can be experimented with
more easily (and moved to the GPU / accelerator more easily)!
79. Word starting to get out!
79
Recent
numba
mailing
list
reports
experiments
of
a
SciPy
author
who
got
2x
speed-‐up
by
removing
their
Cython
type
annotations
and
surrounding
function
with
numba.jit
(with
a
few
minor
changes
needed
to
the
code).
As
soon
as
Numba’s
ahead-‐of-‐time
compilation
moves
beyond
experimental
stage
one
can
legitimately
use
Numba
to
create
a
library
that
you
ship
to
others
(who
then
don’t
need
to
have
Numba
installed
—
or
just
need
a
Numba
run-‐time
installed).
SciPy
(and
NumPy)
would
look
very
different
in
Numba
had
existed
16
years
ago
when
SciPy
was
getting
started….
—
and
you
would
all
be
happier.
81. Releasing the GIL
81
Many
fret
about
the
GIL
in
Python
With
PyData
Stack
you
often
have
multi-‐threaded
In
PyData
Stack
we
quite
often
release
GIL
NumPy
does
it
SciPy
does
it
(quite
often)
Scikit-‐learn
(now)
does
it
Pandas
(now)
does
it
when
possible
Cython
makes
it
easy
Numba
makes
it
easy
84. CUDA Python (in open-source Numba!)
84
CUDA Development
using Python syntax for
optimal performance!
You have to understand
CUDA at least a little —
writing kernels that
launch in parallel on the
GPU
87. Other interesting things
87
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize)
• Call ctypes and cffi functions directly and pass them as
arguments
• Preliminary support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled
code
• See: http://numba.pydata.org/numba-doc/0.20.0/
88. What Doesn’t Work?
88
(A non-comprehensive list)
• Sets, lists, dictionaries, user defined classes (tuples do work!)
• List, set and dictionary comprehensions
• Recursion
• Exceptions with non-constant parameters
• Most string operations (buffer support is very preliminary!)
• yield from
• closures inside a JIT function (compiling JIT functions inside a closure works…)
• Modifying globals
• Passing an axis argument to numpy array reduction functions
• Easy debugging (you have to debug in Python mode).
89. The (Near) Future
89
(Also a non-comprehensive list)
• “JIT Classes”
• Better support for strings/bytes, buffers, and parsing use-
cases
• More coverage of the Numpy API (advanced indexing, etc)
• Documented extension API for adding your own types, low
level function implementations, and targets.
• Better debug workflows
90. Recently Added Numba Features
90
• Recently Added Numba Features
• A new GPU target: the Heterogenous System Architecture, supported
by AMD APUs
• Support for named tuples in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in)
• A simulator for debugging GPU functions with the Python debugger
on the CPU
• Can choose to release the GIL in nopython functions
• Many speed improvements
92. Conclusion
92
• Lots of progress in the past year!
• Try out Numba on your numerical and Numpy-related
projects:
conda install numba
• Your feedback helps us make Numba better!
Tell us what you would like to see:
https://github.com/numba/numba
• Stay tuned for more exciting stuff this year…