Bids talk 9.18

© 2015 Continuum Analytics- Confidential & Proprietary
BIDS Data Science Seminar
Using Anaconda to light-up dark data.
Travis E. Oliphant, PhD
September 18, 2015

Started as a Scientist / Engineer
2
Images from BYU CERS Lab

Science led to Python
3
Raja Muthupillai
Armando Manduca
Richard Ehman
Jim Greenleaf
1997
⇢0 (2⇡f)
2
Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)],j
⌅ = r ⇥ U

“Distractions” led to my calling
4

6
Dark Data: CSV, hdf5, npz, logs, emails, and other files
in your company outside a traditional store

7
Dark Data: CSV, hdf5, npz, logs, emails, and other files
in your company outside a traditional store

8
Database Approach
Data
Sources
Data Store
Data
Sources
Clients

9
Bring the Database to the Data
Data
Sources
Data
Sources
Clients
Blaze
(datashape,dask)
NumPy,Pandas,
SciPy,sklearn,etc.
(for analytics)

Anaconda — portable environments
10
conda
Python'&'R'Open'Source'Analytics
NumPy, SciPy, Pandas, Scikit;learn, Jupyter / IPython,
Numba, Matplotlib, Spyder, Numexpr, Cython, Theano,
Scikit;image, NLTK, NetworkX, IRKernel, dplyr, shiny,
ggplot2, tidyr, caret, nnet and 330+ packages
•Easy to install
•Intuitive to discover
•Quick to analyze
•Simple to collaborate
•Accessible to all

DTYPE INNOVATION (AN ASIDE)
11

Key (potential) benefits of dtype
12
•Turns imperative code into declarative code
•Should provide a solid mechanism for u-func dispatch

Imperative to Declarative
13
NumPyIO
June 1998
My First Python Extension
Reading Analyze Data-Format
fread, fwrite
Data Storage
dtype
arr[1:10,-5].field1

Function dispatch
14
def func(*args):
key = (arg.dtype for arg in args)
return _funcmap[key](*args)
Highly Simplified! — quite a few details to do well…

WHY BLAZE?
15
Thanks to Peter Wang for slides.

23
“General Purpose Programming”

24
Analytics System
Domain-Speciﬁc
Query Language

27
Expressions
Metadata
Runtime

28
+ - / * ^ []
join, groupby, filter
map, sort, take
where, topk
datashape,dtype,
shape,stride
hdf5,json,csv,xls
protobuf,avro,...
NumPy,Pandas,R,
Julia,K,SQL,Spark,
Mongo,Cassandra,...

BLAZE ECOSYSTEM
29
Thanks to Christine Doig for slides.

30
Blaze
datashape
odo
DyND
dask
castra
bcolz data description language
dynamic, multidimensional arrays
parallel computing
data migration
column store
& query
column store
blaze
interface to query data
@mrocklin
@cpcloud
@quasiben
@jcrist @cowlicks
@FrancescAlted
@mwiebe @izaid
@eriknw
@esc
Blaze Ecosystem

31
numpy
pandas
sql DB
Data Runtime Expressions
spark
datashape
metadata storage
odo
paralleloptimized
dask
numbaDyND
blaze
castra
bcolz

32
Data Runtime
Expressions
metadata
storage/containers
compute
APIs, syntax, language
datashape
blaze
dask
odo
parallelize optimize, JIT

BLAZE LIBRARY
33
Thanks to Christine Doig and Phillip Cloud for slides.

34
interface to query data on different storage systems
http://blaze.pydata.org/en/latest/
from blaze import Data
Blaze
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy,
pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).

35
iris[['sepal_length', 'species']]Select columns
log(iris.sepal_length * 10)Operate
Reduce iris.sepal_length.mean()
Split-apply
-combine
by(iris.species, shortest=iris.petal_length.min(),
longest=iris.petal_length.max(),
average=iris.petal_length.mean())
Add new
columns
transform(iris,
sepal_ratio = iris.sepal_length /
iris.sepal_width,
petal_ratio = iris.petal_length / iris.petal_width)
Text matching iris.like(species='*versicolor')
iris.relabel(petal_length='PETAL-LENGTH',
petal_width='PETAL-WIDTH')
Relabel columns
Filter iris[(iris.species == 'Iris-setosa') &
(iris.sepal_length > 5.0)]
Blaze

36
datashapeblaze
Blaze uses datashape as its type system (like DyND)
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")

37
a structured data description language
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
* var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
Data Shape

38
{
flowersdb: {
iris: var * {
species: string
}
},
iriscsv: var * {
sepal_length: ?float64,
sepal_width: ?float64,
petal_length: ?float64,
petal_width: ?float64,
species: ?string
},
irisjson: var * {
species: string
},
irismongo: 150 * {
species: string
}
}
datashape
# Arrays
3 * 4 * int32
3 * 4 * int32
10 * var * float64
3 * complex[float64]
# Arrays of Structures
100 * {
name: string,
birthday: date,
address: {
street: string,
city: string,
postalcode: string,
country: string
}
}
# Structure of Arrays
{
x: 100 * 100 * float32,
y: 100 * 100 * float32,
u: 100 * 100 * float32,
v: 100 * 100 * float32,
}
# Function prototype
(3 * int32, float64) -> 3 * float64
# Function prototype with broadcasting dimensions
(A... * int32, A... * int32) -> A... * int32

iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount:
float64}"
irismongo:
source: mongodb://localhost/mydb::iris
server.yaml
YAML
39
Builds off of Blaze uniform interface to host data
remotely through a JSON web API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
Blaze Server — Lights up your Dark Data

40
Blaze Client
>>> from blaze import Data
>>> s = Data('blaze://localhost:6363')
>>> t.fields
[u'iriscsv', u'irisdb', u'irisjson', u’irismongo']
>>> t.iriscsv
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
>>> t.irisdb
petal_length petal_width sepal_length sepal_width species
0 1.4 0.2 5.1 3.5 Iris-setosa
1 1.4 0.2 4.9 3.0 Iris-setosa
2 1.3 0.2 4.7 3.2 Iris-setosa
Blaze Server

Compute recipes work with existing libraries and
have multiple backends.
• python list
• numpy arrays
• dynd
• pandas DataFrame
• Spark, Impala
• Mongo
• dask 41

42
• Ideally, you can layer expressions over any
data.
• Write once, deploy anywhere.
• Practically, expressions will work better on
specific data structures, formats, and engines.
• You will need to copy from one format and/or
engine to another

ODO LIBRARY
43
Thanks to Phillip Cloud and Christine Doig for slides.

Odo
• A library for turning things into other things
• Factored out from the blaze project
• Handles a huge variety of conversions
• odo is cp with types, for data
44

45
data migration, ~ cp with types, for data
http://odo.pydata.org/en/latest/
from odo import odo
odo(source, target)
odo('iris.json', 'mongodb://localhost/mydb::iris')
odo('iris.json', 'sqlite:///flowers.db::iris')
odo('iris.csv', 'iris.json')
odo('iris.csv', 'hdfs://hostname:iris.csv')
odo('hive://hostname/default::iris_csv',
'hive://hostname/default::iris_parquet',
stored_as='PARQUET', external=False)
odo

46
Through a network of
conversions
How Does It Work?

47
Each node is a type (DataFrame, list, sqlalchemy.Table, etc...)
Each edge is a conversion function

It’s extensible!
48

DASK
49
Thanks to Christine Doig and Blake Griffith for slides

50
enables parallel computing
http://dask.pydata.org/en/latest/
parallel computing
shared
memory
distributed
cluster
single core
computing
Gigabyte
Fits in memory
Terabyte
Fits on disk
Petabyte
Fits on many disks
dask

51
parallel computing
shared
memory
distributed
cluster
single core
computing
Gigabyte
Fits in memory
Terabyte
Fits on disk
Petabyte
Fits on many disks
numpy, pandas dask
dask.distributed
dask

52
parallel computing
shared
memory
distributed
cluster
single core
computing
numpy, pandas dask
dask.distributed
threaded scheduler multiprocessing scheduler
dask

53
numpy dask
>>> import numpy as np
>>> np_ones = np.ones((5000, 1000))
>>> np_ones
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> np_y = np.log(np_ones + 1)[:5].sum(axis=1)
>>> np_y
array([ 693.14718056, 693.14718056,
693.14718056, 693.14718056, 693.14718056])
>>> import dask.array as da
>>> da_ones = da.ones((5000000, 1000000),
chunks=(1000, 1000))
>>> da_ones.compute()
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> da_y = da.log(da_ones + 1)[:5].sum(axis=1)
>>> np_da_y = np.array(da_y) #fits in memory
array([ 693.14718056, 693.14718056,
693.14718056, 693.14718056, …, 693.14718056])
# Result doesn’t fit in memory
>>> da_y.to_hdf5('myfile.hdf5', 'result')
dask array

54
pandas dask
>>> import pandas as pd
>>> df = pd.read_csv('iris.csv')
>>> df.head()
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
>>> max_sepal_length_setosa = df[df.species ==
'setosa'].sepal_length.max()
5.7999999999999998
>>> import dask.dataframe as dd
>>> ddf = dd.read_csv('*.csv')
>>> ddf.head()
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
…
>>> d_max_sepal_length_setosa = ddf[ddf.species ==
'setosa'].sepal_length.max()
>>> d_max_sepal_length_setosa.compute()
5.7999999999999998
dask dataframe

55
semi-structure data, like
JSON blobs or log ﬁles
>>> import dask.bag as db
>>> import json
# Get tweets as a dask.bag from compressed json files
>>> b = db.from_filenames('*.json.gz').map(json.loads)
# Take two items in dask.bag
>>> b.take(2)
({u'contributors': None,
u'coordinates': None,
u'created_at': u'Fri Oct 10 17:19:35 +0000 2014',
u'entities': {u'hashtags': [],
u'symbols': [],
u'trends': [],
u'urls': [],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'filter_level': u'medium',
u'geo': None …
# Count the frequencies of user locations
>>> freq = b.pluck('user').pluck('location').frequencies()
# Get the result as a dataframe
>>> df = freq.to_dataframe()
>>> df.compute()
0 1
0 20916
1 Natal 2
2 Planet earth. Sheffield. 1
3 Mad, USERA 1
4 Brasilia DF - Brazil 2
5 Rondonia Cacoal 1
6 msftsrep || 4/5. 1
dask bag

56
>>> import dask
>>> from dask.distributed import Client
# client connected to 50 nodes, 2 workers per node.
>>> dc = Client('tcp://localhost:9000')
# or
>>> dc = Client('tcp://ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:9000')
>>> b = db.from_s3('githubarchive-data', '2015-*.json.gz').map(json.loads)
# use default single node scheduler
>>> top_commits.compute()
# use client with distributed cluster
>>> top_commits.compute(get=dc.get)
[(u'mirror-updates', 1463019),
(u'KenanSulayman', 235300),
(u'greatfirebot', 167558),
(u'rydnr', 133323),
(u'markkcc', 127625)]
dask distributed

57
daskblaze
e.g. we can drive dask arrays with blaze.
>>> x = da.from_array(...) # Make a dask array
>>> from blaze import Data, log, compute
>>> d = Data(x) # Wrap with Blaze
>>> y = log(d + 1)[:5].sum(axis=1) # Do work as usual
>>> result = compute(y) # Fall back to dask
dask can be a backend/engine for blaze

•Collections build task graphs
•Schedulers execute task graphs
•Graph specification = uniting interface
58

Questions?
http://dask.pydata.org
59

NUMBA
60
Thanks to Stan Seibert for slides

Space of Python Compilation
61
Ahead Of Time Just In Time
Relies on
CPython /
libpython
Cython
Shedskin
Nuitka (today)
Pythran
Numba
Numba
HOPE
Theano
Pyjion
Replaces
CPython /
libpython
Nuitka (future) Pyston
PyPy

Compiler overview
62
Intermediate
Representation
(IR)
x86
C++
ARM
PTX
C
Fortran
ObjC
Code
Generation

Backend
Parsing
Frontend

Numba
63
Intermediate
Representation
(IR)
x86
ARM
PTX
Python
LLVMNumba
Parsing
Frontend
Code
Generation

Backend

How Numba works
65
Bytecode
Analysis
Python
Function
Function
Arguments
Type
Inference
Numba IR
LLVM IR
Machine
Code
@jit
def do_math(a,b):
…
>>> do_math(x, y)
Cache
Execute!
Rewrite IR
Lowering
LLVM JIT

Numba Features
66
• Numba supports:
Windows, OS X, and Linux
32 and 64-bit x86 CPUs and NVIDIA GPUs
Python 2 and 3
NumPy versions 1.6 through 1.9
• Does not require a C/C++ compiler on the user’s system.
• < 70 MB to install.
• Does not replace the standard Python interpreter 
(all of your existing Python libraries are still available)

Numba Modes
67
• object mode: Compiled code operates on Python
objects. Only significant performance improvement is
compilation of loops that can be compiled in nopython
mode (see below).
• nopython mode: Compiled code operates on “machine
native” data. Usually within 25% of the performance of
equivalent C or FORTRAN.

How to Use Numba
68
1. Create a realistic benchmark test case. 
(Do not use your unit tests as a benchmark!)
2. Run a profiler on your benchmark. 
(cProfile is a good choice)
3. Identify hotspots that could potentially be compiled by Numba with a
little refactoring. 
(see rest of this talk and online documentation)
4. Apply @numba.jit and @numba.vectorize as needed to critical
functions.  
(Small rewrites may be needed to work around Numba limitations.)
5. Re-run benchmark to check if there was a performance improvement.

A Whirlwind Tour of Numba Features
69
• Sometimes you can’t create a simple or efficient array
expression or ufunc. Use Numba to work with array
elements directly.
• Example: Suppose you have a boolean grid and you
want to find the maximum number neighbors a cell has
in the grid:

The Basics
71
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup!
Numba decorator 
(nopython=True not required)

Calling Other Functions
73
This function is not
inlined
This function is inlined
9.8x speedup compared to doing
this with numpy functions

Making Ufuncs
75
Monte Carlo simulating 500,000 tournaments in 50 ms

Case-study -- j0 from scipy.special
76
• scipy.special was one of the ﬁrst libraries I wrote (in 1999)
• extended “umath” module by adding new “universal functions” to
compute many scientiﬁc functions by wrapping C and Fortran libs.
• Bessel functions are solutions to a differential equation:
x2 d2
y
dx2
+ x
dy
dx
+ (x2
↵2
)y = 0
y = J↵ (x)
Jn (x) =
1
⇡
Z ⇡
0
cos (n⌧ x sin (⌧)) d⌧

scipy.special.j0 wraps cephes algorithm
77
Don’t
need
this
anymore!

Result --- equivalent to compiled code
78
In [6]: %timeit vj0(x)
10000 loops, best of 3: 75 us per loop
In [7]: from scipy.special import j0
In [8]: %timeit j0(x)
10000 loops, best of 3: 75.3 us per loop
But! Now code is in Python and can be experimented with
more easily (and moved to the GPU / accelerator more easily)!

Word starting to get out!
79
Recent
numba
mailing
list
reports
experiments
of
a
SciPy
author
who
got
2x

speed-‐up
by
removing
their
Cython
type
annotations
and
surrounding

function
with
numba.jit
(with
a
few
minor
changes
needed
to
the
code).
As
soon
as
Numba’s
ahead-‐of-‐time
compilation
moves
beyond
experimental

stage
one
can
legitimately
use
Numba
to
create
a
library
that
you
ship
to

others
(who
then
don’t
need
to
have
Numba
installed
—
or
just
need
a
Numba

run-‐time
installed).
SciPy
(and
NumPy)
would
look
very
different
in
Numba
had
existed
16
years

ago
when
SciPy
was
getting
started….
—
and
you
would
all
be
happier.

Releasing the GIL
81
Many
fret
about
the
GIL
in
Python

With
PyData
Stack
you
often
have
multi-‐threaded

In
PyData
Stack
we
quite
often
release
GIL

NumPy
does
it

SciPy
does
it
(quite
often)

Scikit-‐learn
(now)
does
it

Pandas
(now)
does
it
when
possible

Cython
makes
it
easy

Numba
makes
it
easy

Releasing the GIL
82
Only nopython mode
functions can release
the GIL

Releasing the GIL
83
2.8x speedup with 4 cores

CUDA Python (in open-source Numba!)
84
CUDA Development
using Python syntax for
optimal performance!
You have to understand
CUDA at least a little —
writing kernels that
launch in parallel on the
GPU

Black-Scholes: Results
86
core i7
GeForce GTX
560 Ti About 9x
faster on this
GPU
~ same speed
as CUDA-C

Other interesting things
87
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize)
• Call ctypes and cffi functions directly and pass them as
arguments
• Preliminary support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled
code
• See: http://numba.pydata.org/numba-doc/0.20.0/

What Doesn’t Work?
88
(A non-comprehensive list)
• Sets, lists, dictionaries, user defined classes (tuples do work!)
• List, set and dictionary comprehensions
• Recursion
• Exceptions with non-constant parameters
• Most string operations (buffer support is very preliminary!)
• yield from
• closures inside a JIT function (compiling JIT functions inside a closure works…)
• Modifying globals
• Passing an axis argument to numpy array reduction functions
• Easy debugging (you have to debug in Python mode).

The (Near) Future
89
(Also a non-comprehensive list)
• “JIT Classes”
• Better support for strings/bytes, buffers, and parsing use-
cases
• More coverage of the Numpy API (advanced indexing, etc)
• Documented extension API for adding your own types, low
level function implementations, and targets.
• Better debug workflows

Recently Added Numba Features
90
• Recently Added Numba Features
• A new GPU target: the Heterogenous System Architecture, supported
by AMD APUs
• Support for named tuples in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in)
• A simulator for debugging GPU functions with the Python debugger
on the CPU
• Can choose to release the GIL in nopython functions
• Many speed improvements

New Features
• Support for ARMv7 (Raspbery Pi 2)
• Python 3.5 support
• NumPy 1.10 support
• Faster loading of pre-compiled functions from the disk
cache
• ufunc compilation for multithreaded CPU and GPU targets
(features only in NumbaPro previously).
91

Conclusion
92
• Lots of progress in the past year!
• Try out Numba on your numerical and Numpy-related
projects:
conda install numba
• Your feedback helps us make Numba better! 
Tell us what you would like to see: 
 
https://github.com/numba/numba
• Stay tuned for more exciting stuff this year…

Thanks
September 18, 2015
•DARPA XDATA program (Chris White and Wade Shen) which helped fund
Numba, Blaze, Dask and Odo.
•Investors of Continuum.
•Clients and Customers of Continuum who help support these projects.
•Numfocus volunteers
•PyData volunteers

Bids talk 9.18

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Bids talk 9.18

Ähnlich wie Bids talk 9.18 (20)

Mehr von Travis Oliphant

Mehr von Travis Oliphant (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bids talk 9.18