SlideShare ist ein Scribd-Unternehmen logo
1 von 102
Downloaden Sie, um offline zu lesen
Python as the Zen of Data Science
Travis E. Oliphant, Ph.D.
Peter Wang
Continuum Analytics
2
Data Science
3
禪 Zen
4
道
Zen Approach
5
• Right Practice
• Right Attitude
• Right Understanding
“The Zen way of calligraphy is to write in the most straightforward,
simple way as if you were a beginner not trying to make something
skillful or beautiful, but simple writing with the full attention as if
you were discovering what you were writing for the first time.”
— Zen Mind, Beginner’s Mind
“Right Understanding”
6
The purpose of studying Buddhism is not to study Buddhism
but to study ourselves. You are not your body.
Zen:
Python:
The purpose of writing Python code is not to just
produce software, but to study ourselves. You are
not your technology stack!
7
• Compose language primitives, built-ins, classes whenever
possible
• Much more powerful and accessible than trying to memorize a
huge list of proprietary functions
• Reject artificial distinctions between where data “should live” and
where it gets computed
• Empower each individual to use their own knowledge, instead of
taking design power out of their hands with pre-ordained
architectures and “stacks”.
Pythonic Approach
Why Python?
8
Analyst
• Uses graphical tools
• Can call functions,
cut & paste code
• Can change some
variables
Gets paid for:
Insight
Excel, VB, Tableau,
Analyst / Data
Developer
• Builds simple apps & workflows
• Used to be "just an analyst"
• Likes coding to solve problems
• Doesn't want to be a "full-time
programmer"
Gets paid (like a rock star) for:
Code that produces insight
SAS, R, Matlab,
Programmer
• Creates frameworks
& compilers
• Uses IDEs
• Degree in CompSci
• Knows multiple
languages
Gets paid for:
Code
C, C++, Java, JS,
Python Python Python
Pythonic Data Analysis
9
• Make an immediate connection with the data (using Pandas+ / NumPy+ /
scikit-learn / Bokeh and matplotlib)
• PyData stack means agility which allows you to rapidly build
understanding with powerful modeling and easy manipulation.
• Let the Data drive the analysis not the technology stack.
• Empower the data-scientist, quant, geophysicist, biochemist, directly with
simple language constructs they can use that “fits their brain.”
• Scale-out is a later step. Python can work with you there too. There are
great tools no matter where your data is located or how it is stored or how
your cluster is managed. Not tied to a particular distributed story.
Zen of Python
10
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Zen of NumPy (and Pandas)
11
• Strided is better than scattered
• Contiguous is better than strided
• Descriptive is better than imperative (use data-types)
• Array-oriented and data-oriented is often better than object-oriented
• Broadcasting is a great idea – use where possible
• Split-apply-combine is a great idea – use where possible
• Vectorized is better than an explicit loop
• Write more ufuncs and generalized ufuncs (numba can help)
• Unless it’s complicated — then use numba
• Think in higher dimensions
“Zen of Data Science”
12
• Get More and better data.
• Better data is determined by better models.
• How you compute matters.
• Put the data in the hands and minds of people with knowledge.
• Fail quickly and often — but not in the same way.
• Where and how the data is stored is secondary to analysis and
understanding.
• Premature horizontal scaling is the root of all evil.
• When you must scale — data locality and parallel algorithms are the key.
• Learn to think in building blocks that can be parallelized.
PyData Stack -- about 3,000,000 users
13
NumPy
scikit-learnscikit-image statsmodels Cython
PyTables/Numexpr Dask / Blaze SymPy Numba
OpenCV astropy BioPython GDALPySAL
... many many more ...
MatplotlibSciPy Bokehpandas
How do we get 100s of dependencies delivered?
14
We started Continuum with 3 primary goals:

1. Scale the NumPy/PyData stack horizontally
2. Make it easy for Python users to produce data-science applications in the browser/
notebook
3. Get more adoption of the PyData stack
So, we wrote a package manager, conda.
and a distribution of Python + R, Anaconda.
Game-Changing
Enterprise Ready
Python Distribution
15
• 2 million downloads in last 2 years
• 200k / month and growing
• conda package manager serves up 5 million
packages per month
• Recommended installer for IPython/Jupyter,
Pandas, SciPy, Scikit-learn, etc.
Some Users
16
Anaconda — Portable Environments
17
• Easy to install
• Quick & agile data exploration
• Powerful data analysis
• Simple to collaborate
• Accessible to all
PYTHON & R OPEN SOURCE ANALYTICS
NumPy SciPy Pandas Scikit-learn Jupyter/ IPython
Numba Matplotlib Spyder Numexpr Cython Theano
Scikit-image NLTK NetworkX IRKernel dplyr shiny
ggplot2 tidyr caret nnet And 330+ packages
conda
Traditional Analytics
Ad hoc usage
and/or BI
report
Production
deployment
Data
Data
Data
Data
Data Mining
and/or
Modelling
ETL
Traditional Analytics
Ad hoc usage
and/or BI
report
Production
deployment
Data
Data
Data
Data
Data Mining
and/or
Modelling
ETL
ANACONDA
Package Managers
20
yum (rpm)
apt-get (dpkg)
Linux OSX
macports
homebrew
fink
Windows
chocolatey
npackd
Cross-platform
conda Sophisticated light-weight
environments included!
http://conda.pydata.org
Conda features
21
• Excellent support for “system-level” environments — like having mini VMs but much
lighter weight than docker (micro containers)
• Minimizes code-copies (uses hard/soft links if possible)
• Simple format: binary tar-ball + metadata
• Metadata allows static analysis of dependencies
• Easy to create multiple “channels” which are repositories for packages
• User installable (no root privileges needed)
• Integrates very well with pip and other language-specific package managers.
• Cross Platform
Basic Conda Usage
22
Install a package conda install sympy
List all installed packages conda list
Search for packages conda search llvm
Create a new environment conda create -n py3k python=3
Remove a package conda remove nose
Get help conda install --help
Advanced Conda Usage
23
Install a package in an environment conda install -n py3k sympy
Update all packages conda update --all
Export list of packages conda list --export packages.txt
Install packages from an export conda install --file packages.txt
See package history conda list --revisions
Revert to a revision conda install --revision 23
Remove unused packages and cached tarballs conda clean -pt
Environments
24
• Environments are simple: just link the package to a different directory
• Hard-links are very cheap, and very fast — even on Windows.
• Conda environments are completely independent installations of
everything
• No fiddling with PYTHONPATH or sym-linking site-packages
• “Activating” an environment just means changing your PATH so
that its bin/ or Scripts/ comes first.
• Unix:

• Windows:
conda create -n py3k python=3.5
source activate py3k
activate py3k
25
Anaconda Platform Analytics Repository
26
• Commercial long-term support
• Private, on-premise package mirror
• Proprietary tools for building custom
distribution, like Anaconda
• Enterprise tools for managing custom
packages and environments
• Available on the cloud at 

http://anaconda.org
Anaconda Cluster: Anaconda + Hadoop + Spark
For Data Scientists:
• Rapidly, easily create clusters on EC2, DigitalOcean, on-prem cloud/provisioner
• Manage Python, R, Java, JS packages across the cluster
For Operations & IT:
• Robustly manage runtime state across the cluster
• Outside the scope of rpm, chef, puppet, etc.
• Isolate/sandbox packages & libraries for different jobs or groups of users
• Without introducing complexity of Docker / virtualization
• Cross platform - same tooling for laptops, workstations, servers, clusters
27
Cluster Creation
28
$ acluster create mycluster —profile=spark_profile
$ acluster submit mycluster mycode.py
$ acluster destroy mycluster
spark_profile:
provider: aws_east
num_nodes: 4
node_id: ami-3c994355
node_type: m1.large
aws_east:
secret_id: <aws_access_key_id>
secret_key: <aws_secret_access_key>
keyname: id_rsa.pub
location: us-east-1
private_key: ~/.ssh/id_rsa
cloud_provider: ec2
security_group: all-open
http://continuumio.github.io/conda-cluster/quickstart.html
Cluster Management
29
$ acluster manage mycluster list
... info -e
... install python=3 pandas flask
... set_env
... push_env <local> <remote>
$ acluster ssh mycluster
$ acluster run.cmd mycluster "cat /etc/hosts"
Package & environment management:
Easy SSH & remote commands:
http://continuumio.github.io/conda-cluster/manage.html
Anaconda Cluster & Spark
30
# example.py
conf = SparkConf()
conf.setMaster("yarn-client")
conf.setAppName("MY APP")
sc = SparkContext(conf=conf)
# analysis
sc.parallelize(range(1000)).map(lambda x: (x, x % 2)).take(10)
$ acluster submit MY_CLUSTER /path/to/example.py
Remember that Blaze has a higher-level interface to Spark and Dask provides
a more Pythonic approach.
ANACONDA OPEN SOURCE
TECHNOLOGY
31
32
• Infrastructure for meta-data, meta-compute, and expression graphs/dataflow
• Data glue for scale-up or scale-out
• Generic remote computation & query system
• (NumPy+Pandas+LINQ+OLAP+PADL).mashup()
Blaze is an extensible high-level interface for data
analytics. It feels like NumPy/Pandas. It drives other
data systems. Blaze expressions enable high-level
reasoning. It’s an ecosystem of tools.
http://blaze.pydata.org
Blaze
Glue 2.0
33
Python’s legacy as a powerful
glue language
• manipulate files
• call fast libraries
Next-gen Glue:
• Link data silos
• Link disjoint memory &
compute
• Unify disparate runtime
models
• Transcend legacy models of
computers
34
35
Data
36
“Math”
Data
37
Math
Big Data
38
Math
Big Data
39
Math
Big Data
40
Math
Big Data
Programs
41
“General Purpose Programming”
42
Domain-Specific
Query Language
Analytics System
43
html,css,js,…
py,r,sql,…
java,c,cpp,cs
44
?
45
Expressions
Metadata
Runtime
46
+ - / * ^ []
join, groupby, filter
map, sort, take
where, topk
datashape,dtype,
shape,stride
hdf5,json,csv,xls
protobuf,avro,...
NumPy,Pandas,R,
Julia,K,SQL,Spark,
Mongo,Cassandra,...
APIs, syntax, language
47
Data Runtime
Expressions
metadata
storage/containers
compute
datashape
blaze
dask
odo
parallelize optimize, JIT
Blaze
48
Interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is the “dark data” and pydata stack for run-time (dask, dynd,
numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
Blaze
49
iris[['sepal_length', 'species']]Select columns
log(iris.sepal_length * 10)Operate
Reduce iris.sepal_length.mean()
Split-apply
-combine
by(iris.species, shortest=iris.petal_length.min(),
longest=iris.petal_length.max(),
average=iris.petal_length.mean())
Add new
columns
transform(iris, sepal_ratio = iris.sepal_length /
iris.sepal_width, petal_ratio = iris.petal_length /
iris.petal_width)
Text matching iris.like(species='*versicolor')
iris.relabel(petal_length='PETAL-LENGTH',
petal_width='PETAL-WIDTH')
Relabel columns
Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]
50
datashapeblaze
Blaze uses datashape as its type system (like DyND)
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")
Datashape
51
A structured data description language
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
*
var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
4
Datashape
52
{
flowersdb: {
iris: var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}
},
iriscsv: var * {
sepal_length: ?float64,
sepal_width: ?float64,
petal_length: ?float64,
petal_width: ?float64,
species: ?string
},
irisjson: var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
},
irismongo: 150 * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}
}
# Arrays of Structures
100 * {
name: string,
birthday: date,
address: {
street: string,
city: string,
postalcode: string,
country: string
}
}
# Structure of Arrays
{
x: 100 * 100 * float32,
y: 100 * 100 * float32,
u: 100 * 100 * float32,
v: 100 * 100 * float32,
}
# Function prototype
(3 * int32, float64) -> 3 * float64
# Function prototype with broadcasting dimensions
(A... * int32, A... * int32) -> A... * int32
# Arrays
3 * 4 * int32
3 * 4 * int32
10 * var * float64
3 * complex[float64]
iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount: float64}"
irismongo:
source: mongodb://localhost/mydb::iris
Blaze Server — Lights up your Dark Data
53
Builds off of Blaze uniform interface
to host data remotely through a JSON
web API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
server.yaml
Blaze Server
54
Blaze Client
>>> from blaze import Data
>>> s = Data('blaze://localhost:6363')
>>> t.fields
[u'iriscsv', u'irisdb', u'irisjson', u’irismongo']
>>> t.iriscsv
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
>>> t.irisdb
petal_length petal_width sepal_length sepal_width species
0 1.4 0.2 5.1 3.5 Iris-setosa
1 1.4 0.2 4.9 3.0 Iris-setosa
2 1.3 0.2 4.7 3.2 Iris-setosa
Compute recipes work with existing libraries and have multiple
backends
• python list
• numpy arrays
• dynd
• pandas DataFrame
• Spark, Impala
• Mongo
• dask
55
• Ideally, you can layer expressions over any data

• Write once, deploy anywhere

• Practically, expressions will work better on specific data
structures, formats, and engines

• Use odo to copy from one format and/or engine to another
56
57
Dask: Out-of-Core PyData
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Core Ideas
• Dynamic task scheduling yields sane parallelism
• Simple library to enable parallelism
• Dask.array/dataframe to encapsulate the functionality
• Distributed scheduler
Example: Ocean Temp Data
58
• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html
• Every 1/4 degree, 720x1440 array each day
Bigger data...
59
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed
If you don't have this much RAM...
... better start chunking.
DAG of Computation
60
• Collections build task graphs
• Schedulers execute task graphs
• Graph specification = uniting interface
• A generalization of RDDs
61
Simple Architecture for Scaling
62
Dask	
  collections	
  
• dask.array	
  
• dask.dataframe	
  
• dask.bag	
  
• dask.imperative*
Python	
  Ecosystem
Dask	
  Graph	
  Specification
Dask	
  Schedulers
dask.array: OOC, parallel, ND array
63
Arithmetic: +, *, ...
Reductions: mean, max, ...
Slicing: x[10:, 100:50:-2]
Fancy indexing: x[:, [3, 1, 2]]
Some linear algebra: tensordot, qr, svd
Parallel algorithms (approximate quantiles, topk, ...)
Slightly overlapping arrays
Integration with HDF5
Dask Array
64
numpy
dask
>>> import numpy as np
>>> np_ones = np.ones((5000, 1000))
>>> np_ones
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> np_y = np.log(np_ones + 1)[:5].sum(axis=1)
>>> np_y
array([ 693.14718056, 693.14718056, 693.14718056,
693.14718056, 693.14718056])
>>> import dask.array as da
>>> da_ones = da.ones((5000000, 1000000),
chunks=(1000, 1000))
>>> da_ones.compute()
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> da_y = da.log(da_ones + 1)[:5].sum(axis=1)
>>> np_da_y = np.array(da_y) #fits in memory
array([ 693.14718056, 693.14718056, 693.14718056,
693.14718056, …, 693.14718056])
# If result doesn’t fit in memory
>>> da_y.to_hdf5('myfile.hdf5', 'result')
dask.dataframe: OOC, parallel, ND array
65
Elementwise operations: df.x + df.y
Row-wise selections: df[df.x > 0]
Aggregations: df.x.max()
groupby-aggregate: df.groupby(df.x).y.max()
Value counts: df.x.value_counts()
Drop duplicates: df.x.drop_duplicates()
Join on index: dd.merge(df1, df2, left_index=True,
right_index=True)
Dask Dataframe
66
pandas dask
>>> import pandas as pd
>>> df = pd.read_csv('iris.csv')
>>> df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
>>> max_sepal_length_setosa = df[df.species ==
'setosa'].sepal_length.max()
5.7999999999999998
>>> import dask.dataframe as dd
>>> ddf = dd.read_csv('*.csv')
>>> ddf.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
…
>>> d_max_sepal_length_setosa = ddf[ddf.species ==
'setosa'].sepal_length.max()
>>> d_max_sepal_length_setosa.compute()
5.7999999999999998
More Complex Graphs
67
cross validation
68
http://continuum.io/blog/xray-dask
69
from dask import dataframe as dd
columns = ["name", "amenity", "Longitude", "Latitude"]
data = dd.read_csv('POIWorld.csv', usecols=columns)
with_name = data[data.name.notnull()]
with_amenity = data[data.amenity.notnull()]
is_starbucks = with_name.name.str.contains('[Ss]tarbucks')
is_dunkin = with_name.name.str.contains('[Dd]unkin')
starbucks = with_name[is_starbucks]
dunkin = with_name[is_dunkin]
locs = dd.compute(starbucks.Longitude,
starbucks.Latitude,
dunkin.Longitude,
dunkin.Latitude)
# extract arrays of values fro the series:
lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs]
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
def draw_USA():
"""initialize a basemap centered on the continental USA"""
plt.figure(figsize=(14, 10))
return Basemap(projection='lcc', resolution='l',
llcrnrlon=-119, urcrnrlon=-64,
llcrnrlat=22, urcrnrlat=49,
lat_1=33, lat_2=45, lon_0=-95,
area_thresh=10000)
m = draw_USA()
# Draw map background
m.fillcontinents(color='white', lake_color='#eeeeee')
m.drawstates(color='lightgray')
m.drawcoastlines(color='lightgray')
m.drawcountries(color='lightgray')
m.drawmapboundary(fill_color='#eeeeee')
# Plot the values in Starbucks Green and Dunkin Donuts Orange
style = dict(s=5, marker='o', alpha=0.5, zorder=2)
m.scatter(lon_s, lat_s, latlon=True,
label="Starbucks", color='#00592D', **style)
m.scatter(lon_d, lat_d, latlon=True,
label="Dunkin' Donuts", color='#FC772A', **style)
plt.legend(loc='lower left', frameon=False);
Distributed
70
Pythonic Multiple-machine Parallelism that understands Dask graphs
1) Defines Center (dcenter) and Worker (dworker)
2) Simplified setup with dcluster for example —
Center
dcluster 192.168.0.{1,2,3,4}
dcluster —hostfile hostfile.txt
or
3) Create Executor objects like

concurrent.futures (Python 3) or

futures (Python 2.7 back-port)
4) Data locality supported with ad-hoc task graphs

by returning futures wherever possible
New Libray but stabilizing quickly — communicate with blaze-dev@continuum.io
Python and Hadoop
(without the JVM)
71
Chat:	
  	
  http://gitter.im/blaze/dask	
  
Email:	
  	
  blaze-­‐dev@continuum.io
Join	
  the	
  conversation!
HDFS without Java
72
1. HDFS splits large files into many small blocks replicated on many
datanodes
2. For efficient computation we must use data directly on datanodes
3. distributed.hdfs queries the locations of the individual blocks
4. distributed executes functions directly on those blocks on the
datanodes
5. distributed+pandas enables distributed CSV processing on HDFS
in pure Python
6. Coming soon — dask on hdfs
73
$ hdfs dfs -cp yellow_tripdata_2014-01.csv /data/nyctaxi/
>>> from distributed import hdfs
>>> blocks = hdfs.get_locations('/data/nyctaxi/', '192.168.50.100', 9000)
>>> columns = ['vendor_id', 'pickup_datetime', 'dropoff_datetime',
'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude',
'rate_code', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude',
'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tip_amount',
'tolls_amount', 'total_amount']
>>> from distributed import Executor
>>> executor = Executor('192.168.1.100:8787')
>>> dfs = [executor.submit(pd.read_csv, block['path'], workers=block['hosts'],
... columns=columns, skiprows=1)
... for block in blocks]
These operations produce Future objects that point to remote results on the worker computers. This does not
pull results back to local memory. We can use these futures in later computations with the executor.
74
def sum_series(seq):
result = seq[0]
for s in seq[1:]:
result = result.add(s, fill_value=0)
return result
>>> counts = executor.map(lambda df: df.passenger_count.value_counts(), dfs)
>>> total = executor.submit(sum_series, counts)
>>> total.result()
0 259
1 9727301
2 1891581
3 566248
4 267540
5 789070
6 540444
7 7
8 5
9 16
208 19
Bokeh
75
http://bokeh.pydata.org
• Interactive visualization
• Novel graphics
• Streaming, dynamic, large data
• For the browser, with or without a server
• No need to write Javascript
Versatile Plots
76
Novel Graphics
77
Previous: Javascript code generation
78
server.py Browser
js_str = """ <d3.js>
<highchart.js>
<etc.js>
"""
plot.js.template
App Model
D3
highcharts
flot
crossfilter
etc. ...
One-shot; no MVC interaction; no data streaming
HTML
bokeh.py & bokeh.js
79
server.py BrowserApp Model
BokehJS
object graph
bokeh-server
bokeh.py
object graph
JSON
80
81
4GB Interactive Web Viz
rBokeh
82
http://hafen.github.io/rbokeh
83
84
85
86
hGp://nbviewer.ipython.org/github/bokeh/bokeh-­‐notebooks/blob/master/tutorial/00	
  -­‐	
  intro.ipynb#InteracLon	
  
Additional Demos & Topics
87
• Airline flights
• Pandas table
• Streaming / Animation
• Large data rendering
88
• Dynamic, just-in-time compiler for Python & NumPy
• Uses LLVM
• Outputs x86 and GPU (CUDA, HSA)
• (Premium version is in Accelerate part of 

Anaconda Workgroup and Anaconda Enterprise subscriptions)
http://numba.pydata.org
Numba
Python Compilation Space
89
Ahead Of Time Just In Time
Relies on
CPython /
libpython
Cython
Shedskin
Nuitka (today)
Pythran
Numba
HOPE
Theano
Replaces
CPython /
libpython
Nuitka (future) Pyston
PyPy
Example
90
Numba
91
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up
Numba Features
• Numba supports:
– Windows, OS X, and Linux
– 32 and 64-bit x86 CPUs and NVIDIA GPUs
– Python 2 and 3
– NumPy versions 1.6 through 1.9
• Does not require a C/C++ compiler on the user’s system.
• < 70 MB to install.
• Does not replace the standard Python interpreter

(all of your existing Python libraries are still available)
92
Numba Modes
• object mode: Compiled code operates on Python objects. Only
significant performance improvement is compilation of loops that can be
compiled in nopython mode (see below).
• nopython mode: Compiled code operates on “machine native” data.
Usually within 25% of the performance of equivalent C or FORTRAN.
93
The Basics
94
The Basics
95
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup!
Numba decorator

(nopython=True not required)
CUDA Python (in open-source Numba!)
96
CUDA Development
using Python syntax for
optimal performance!
You have to understand
CUDA at least a little —
writing kernels that launch
in parallel on the GPU
Example: Black-Scholes
97
Black-Scholes: Results
98
core i7 GeForce GTX 560 Ti
About 9x faster
on this GPU
~ same speed as
CUDA-C
Other interesting things
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize)
• Call ctypes and cffi functions directly and pass them as arguments
• Preliminary support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled code
• See: http://numba.pydata.org/numba-doc/0.20.0/
99
What Doesn’t Work?
(A non-comprehensive list)
• Sets, lists, dictionaries, user defined classes (tuples do work!)
• List, set and dictionary comprehensions
• Recursion
• Exceptions with non-constant parameters
• Most string operations (buffer support is very preliminary!)
• yield from
• closures inside a JIT function (compiling JIT functions inside a closure works…)
• Modifying globals
• Passing an axis argument to numpy array reduction functions
• Easy debugging (you have to debug in Python mode).
100
How Numba Works
101
Bytecode
Analysis
Python Function
(bytecode)
Function Arguments
Type Inference
Numba IR
Rewrite IR
Lowering
LLVM IRLLVM JITMachine Code
@jit
def do_math(a,b):
…
>>> do_math(x, y)
Cache
Execute!
Recently Added Numba Features
• A new GPU target: the Heterogenous System Architecture, supported by AMD APUs
• Support for named tuples in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in) — both LLVM and pre-compiled
• A simulator for debugging GPU functions with the Python debugger on the CPU
• Can choose to release the GIL in nopython functions
• Ahead of time compilation
• vectorize and guvectorize on GPU and parallel targets now in open-source Numba!
• JIT Classes — coming soon!
102

Weitere ähnliche Inhalte

Was ist angesagt?

Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
DataWorks Summit
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Simplilearn
 

Was ist angesagt? (15)

PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
IPython: A Modern Vision of Interactive Computing (PyData SV 2013)
IPython: A Modern Vision of Interactive Computing (PyData SV 2013)IPython: A Modern Vision of Interactive Computing (PyData SV 2013)
IPython: A Modern Vision of Interactive Computing (PyData SV 2013)
 
Apache Toree
Apache ToreeApache Toree
Apache Toree
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Hadoop summit 2016
Hadoop summit 2016Hadoop summit 2016
Hadoop summit 2016
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 

Ähnlich wie Python as the Zen of Data Science

1-Intro-to-Python-Data-Science-Libraries-and-Pytorch.pptx
1-Intro-to-Python-Data-Science-Libraries-and-Pytorch.pptx1-Intro-to-Python-Data-Science-Libraries-and-Pytorch.pptx
1-Intro-to-Python-Data-Science-Libraries-and-Pytorch.pptx
oesmail21
 
Apache pig as a researcher’s stepping stone
Apache pig as a researcher’s stepping stoneApache pig as a researcher’s stepping stone
Apache pig as a researcher’s stepping stone
benosteen
 

Ähnlich wie Python as the Zen of Data Science (20)

Session 2
Session 2Session 2
Session 2
 
AI Deep Learning - CF Machine Learning
AI Deep Learning - CF Machine LearningAI Deep Learning - CF Machine Learning
AI Deep Learning - CF Machine Learning
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
1-Intro-to-Python-Data-Science-Libraries-and-Pytorch.pptx
1-Intro-to-Python-Data-Science-Libraries-and-Pytorch.pptx1-Intro-to-Python-Data-Science-Libraries-and-Pytorch.pptx
1-Intro-to-Python-Data-Science-Libraries-and-Pytorch.pptx
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
PyTables
PyTablesPyTables
PyTables
 
Py tables
Py tablesPy tables
Py tables
 
Apache pig as a researcher’s stepping stone
Apache pig as a researcher’s stepping stoneApache pig as a researcher’s stepping stone
Apache pig as a researcher’s stepping stone
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
PyTables
PyTablesPyTables
PyTables
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
ANN-Lecture2-Python Startup.pptx
ANN-Lecture2-Python Startup.pptxANN-Lecture2-Python Startup.pptx
ANN-Lecture2-Python Startup.pptx
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 

Mehr von Travis Oliphant

Mehr von Travis Oliphant (14)

Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
 
SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for Python
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
Numba
NumbaNumba
Numba
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Python as the Zen of Data Science

  • 1. Python as the Zen of Data Science Travis E. Oliphant, Ph.D. Peter Wang Continuum Analytics
  • 5. Zen Approach 5 • Right Practice • Right Attitude • Right Understanding “The Zen way of calligraphy is to write in the most straightforward, simple way as if you were a beginner not trying to make something skillful or beautiful, but simple writing with the full attention as if you were discovering what you were writing for the first time.” — Zen Mind, Beginner’s Mind
  • 6. “Right Understanding” 6 The purpose of studying Buddhism is not to study Buddhism but to study ourselves. You are not your body. Zen: Python: The purpose of writing Python code is not to just produce software, but to study ourselves. You are not your technology stack!
  • 7. 7 • Compose language primitives, built-ins, classes whenever possible • Much more powerful and accessible than trying to memorize a huge list of proprietary functions • Reject artificial distinctions between where data “should live” and where it gets computed • Empower each individual to use their own knowledge, instead of taking design power out of their hands with pre-ordained architectures and “stacks”. Pythonic Approach
  • 8. Why Python? 8 Analyst • Uses graphical tools • Can call functions, cut & paste code • Can change some variables Gets paid for: Insight Excel, VB, Tableau, Analyst / Data Developer • Builds simple apps & workflows • Used to be "just an analyst" • Likes coding to solve problems • Doesn't want to be a "full-time programmer" Gets paid (like a rock star) for: Code that produces insight SAS, R, Matlab, Programmer • Creates frameworks & compilers • Uses IDEs • Degree in CompSci • Knows multiple languages Gets paid for: Code C, C++, Java, JS, Python Python Python
  • 9. Pythonic Data Analysis 9 • Make an immediate connection with the data (using Pandas+ / NumPy+ / scikit-learn / Bokeh and matplotlib) • PyData stack means agility which allows you to rapidly build understanding with powerful modeling and easy manipulation. • Let the Data drive the analysis not the technology stack. • Empower the data-scientist, quant, geophysicist, biochemist, directly with simple language constructs they can use that “fits their brain.” • Scale-out is a later step. Python can work with you there too. There are great tools no matter where your data is located or how it is stored or how your cluster is managed. Not tied to a particular distributed story.
  • 10. Zen of Python 10 >>> import this The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
  • 11. Zen of NumPy (and Pandas) 11 • Strided is better than scattered • Contiguous is better than strided • Descriptive is better than imperative (use data-types) • Array-oriented and data-oriented is often better than object-oriented • Broadcasting is a great idea – use where possible • Split-apply-combine is a great idea – use where possible • Vectorized is better than an explicit loop • Write more ufuncs and generalized ufuncs (numba can help) • Unless it’s complicated — then use numba • Think in higher dimensions
  • 12. “Zen of Data Science” 12 • Get More and better data. • Better data is determined by better models. • How you compute matters. • Put the data in the hands and minds of people with knowledge. • Fail quickly and often — but not in the same way. • Where and how the data is stored is secondary to analysis and understanding. • Premature horizontal scaling is the root of all evil. • When you must scale — data locality and parallel algorithms are the key. • Learn to think in building blocks that can be parallelized.
  • 13. PyData Stack -- about 3,000,000 users 13 NumPy scikit-learnscikit-image statsmodels Cython PyTables/Numexpr Dask / Blaze SymPy Numba OpenCV astropy BioPython GDALPySAL ... many many more ... MatplotlibSciPy Bokehpandas
  • 14. How do we get 100s of dependencies delivered? 14 We started Continuum with 3 primary goals:
 1. Scale the NumPy/PyData stack horizontally 2. Make it easy for Python users to produce data-science applications in the browser/ notebook 3. Get more adoption of the PyData stack So, we wrote a package manager, conda. and a distribution of Python + R, Anaconda.
  • 15. Game-Changing Enterprise Ready Python Distribution 15 • 2 million downloads in last 2 years • 200k / month and growing • conda package manager serves up 5 million packages per month • Recommended installer for IPython/Jupyter, Pandas, SciPy, Scikit-learn, etc.
  • 17. Anaconda — Portable Environments 17 • Easy to install • Quick & agile data exploration • Powerful data analysis • Simple to collaborate • Accessible to all PYTHON & R OPEN SOURCE ANALYTICS NumPy SciPy Pandas Scikit-learn Jupyter/ IPython Numba Matplotlib Spyder Numexpr Cython Theano Scikit-image NLTK NetworkX IRKernel dplyr shiny ggplot2 tidyr caret nnet And 330+ packages conda
  • 18. Traditional Analytics Ad hoc usage and/or BI report Production deployment Data Data Data Data Data Mining and/or Modelling ETL
  • 19. Traditional Analytics Ad hoc usage and/or BI report Production deployment Data Data Data Data Data Mining and/or Modelling ETL ANACONDA
  • 20. Package Managers 20 yum (rpm) apt-get (dpkg) Linux OSX macports homebrew fink Windows chocolatey npackd Cross-platform conda Sophisticated light-weight environments included! http://conda.pydata.org
  • 21. Conda features 21 • Excellent support for “system-level” environments — like having mini VMs but much lighter weight than docker (micro containers) • Minimizes code-copies (uses hard/soft links if possible) • Simple format: binary tar-ball + metadata • Metadata allows static analysis of dependencies • Easy to create multiple “channels” which are repositories for packages • User installable (no root privileges needed) • Integrates very well with pip and other language-specific package managers. • Cross Platform
  • 22. Basic Conda Usage 22 Install a package conda install sympy List all installed packages conda list Search for packages conda search llvm Create a new environment conda create -n py3k python=3 Remove a package conda remove nose Get help conda install --help
  • 23. Advanced Conda Usage 23 Install a package in an environment conda install -n py3k sympy Update all packages conda update --all Export list of packages conda list --export packages.txt Install packages from an export conda install --file packages.txt See package history conda list --revisions Revert to a revision conda install --revision 23 Remove unused packages and cached tarballs conda clean -pt
  • 24. Environments 24 • Environments are simple: just link the package to a different directory • Hard-links are very cheap, and very fast — even on Windows. • Conda environments are completely independent installations of everything • No fiddling with PYTHONPATH or sym-linking site-packages • “Activating” an environment just means changing your PATH so that its bin/ or Scripts/ comes first. • Unix:
 • Windows: conda create -n py3k python=3.5 source activate py3k activate py3k
  • 25. 25
  • 26. Anaconda Platform Analytics Repository 26 • Commercial long-term support • Private, on-premise package mirror • Proprietary tools for building custom distribution, like Anaconda • Enterprise tools for managing custom packages and environments • Available on the cloud at 
 http://anaconda.org
  • 27. Anaconda Cluster: Anaconda + Hadoop + Spark For Data Scientists: • Rapidly, easily create clusters on EC2, DigitalOcean, on-prem cloud/provisioner • Manage Python, R, Java, JS packages across the cluster For Operations & IT: • Robustly manage runtime state across the cluster • Outside the scope of rpm, chef, puppet, etc. • Isolate/sandbox packages & libraries for different jobs or groups of users • Without introducing complexity of Docker / virtualization • Cross platform - same tooling for laptops, workstations, servers, clusters 27
  • 28. Cluster Creation 28 $ acluster create mycluster —profile=spark_profile $ acluster submit mycluster mycode.py $ acluster destroy mycluster spark_profile: provider: aws_east num_nodes: 4 node_id: ami-3c994355 node_type: m1.large aws_east: secret_id: <aws_access_key_id> secret_key: <aws_secret_access_key> keyname: id_rsa.pub location: us-east-1 private_key: ~/.ssh/id_rsa cloud_provider: ec2 security_group: all-open http://continuumio.github.io/conda-cluster/quickstart.html
  • 29. Cluster Management 29 $ acluster manage mycluster list ... info -e ... install python=3 pandas flask ... set_env ... push_env <local> <remote> $ acluster ssh mycluster $ acluster run.cmd mycluster "cat /etc/hosts" Package & environment management: Easy SSH & remote commands: http://continuumio.github.io/conda-cluster/manage.html
  • 30. Anaconda Cluster & Spark 30 # example.py conf = SparkConf() conf.setMaster("yarn-client") conf.setAppName("MY APP") sc = SparkContext(conf=conf) # analysis sc.parallelize(range(1000)).map(lambda x: (x, x % 2)).take(10) $ acluster submit MY_CLUSTER /path/to/example.py Remember that Blaze has a higher-level interface to Spark and Dask provides a more Pythonic approach.
  • 32. 32 • Infrastructure for meta-data, meta-compute, and expression graphs/dataflow • Data glue for scale-up or scale-out • Generic remote computation & query system • (NumPy+Pandas+LINQ+OLAP+PADL).mashup() Blaze is an extensible high-level interface for data analytics. It feels like NumPy/Pandas. It drives other data systems. Blaze expressions enable high-level reasoning. It’s an ecosystem of tools. http://blaze.pydata.org Blaze
  • 33. Glue 2.0 33 Python’s legacy as a powerful glue language • manipulate files • call fast libraries Next-gen Glue: • Link data silos • Link disjoint memory & compute • Unify disparate runtime models • Transcend legacy models of computers
  • 34. 34
  • 44. 44 ?
  • 46. 46 + - / * ^ [] join, groupby, filter map, sort, take where, topk datashape,dtype, shape,stride hdf5,json,csv,xls protobuf,avro,... NumPy,Pandas,R, Julia,K,SQL,Spark, Mongo,Cassandra,...
  • 47. APIs, syntax, language 47 Data Runtime Expressions metadata storage/containers compute datashape blaze dask odo parallelize optimize, JIT
  • 48. Blaze 48 Interface to query data on different storage systems http://blaze.pydata.org/en/latest/ from blaze import Data iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv')S3 … Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
  • 49. Blaze 49 iris[['sepal_length', 'species']]Select columns log(iris.sepal_length * 10)Operate Reduce iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) Text matching iris.like(species='*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]
  • 50. 50 datashapeblaze Blaze uses datashape as its type system (like DyND) >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")
  • 51. Datashape 51 A structured data description language http://datashape.pydata.org/ dimension dtype unit types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels 4
  • 52. Datashape 52 { flowersdb: { iris: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32 # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64]
  • 53. iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape: "var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris Blaze Server — Lights up your Dark Data 53 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json server.yaml
  • 54. Blaze Server 54 Blaze Client >>> from blaze import Data >>> s = Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa
  • 55. Compute recipes work with existing libraries and have multiple backends • python list • numpy arrays • dynd • pandas DataFrame • Spark, Impala • Mongo • dask 55
  • 56. • Ideally, you can layer expressions over any data
 • Write once, deploy anywhere
 • Practically, expressions will work better on specific data structures, formats, and engines
 • Use odo to copy from one format and/or engine to another 56
  • 57. 57 Dask: Out-of-Core PyData • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism • Dask.array/dataframe to encapsulate the functionality • Distributed scheduler
  • 58. Example: Ocean Temp Data 58 • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day
  • 59. Bigger data... 59 36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed If you don't have this much RAM... ... better start chunking.
  • 61. • Collections build task graphs • Schedulers execute task graphs • Graph specification = uniting interface • A generalization of RDDs 61
  • 62. Simple Architecture for Scaling 62 Dask  collections   • dask.array   • dask.dataframe   • dask.bag   • dask.imperative* Python  Ecosystem Dask  Graph  Specification Dask  Schedulers
  • 63. dask.array: OOC, parallel, ND array 63 Arithmetic: +, *, ... Reductions: mean, max, ... Slicing: x[10:, 100:50:-2] Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svd Parallel algorithms (approximate quantiles, topk, ...) Slightly overlapping arrays Integration with HDF5
  • 64. Dask Array 64 numpy dask >>> import numpy as np >>> np_ones = np.ones((5000, 1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # If result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result')
  • 65. dask.dataframe: OOC, parallel, ND array 65 Elementwise operations: df.x + df.y Row-wise selections: df[df.x > 0] Aggregations: df.x.max() groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts() Drop duplicates: df.x.drop_duplicates() Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
  • 66. Dask Dataframe 66 pandas dask >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998
  • 69. 69 from dask import dataframe as dd columns = ["name", "amenity", "Longitude", "Latitude"] data = dd.read_csv('POIWorld.csv', usecols=columns) with_name = data[data.name.notnull()] with_amenity = data[data.amenity.notnull()] is_starbucks = with_name.name.str.contains('[Ss]tarbucks') is_dunkin = with_name.name.str.contains('[Dd]unkin') starbucks = with_name[is_starbucks] dunkin = with_name[is_dunkin] locs = dd.compute(starbucks.Longitude, starbucks.Latitude, dunkin.Longitude, dunkin.Latitude) # extract arrays of values fro the series: lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs] %matplotlib inline import matplotlib.pyplot as plt from mpl_toolkits.basemap import Basemap def draw_USA(): """initialize a basemap centered on the continental USA""" plt.figure(figsize=(14, 10)) return Basemap(projection='lcc', resolution='l', llcrnrlon=-119, urcrnrlon=-64, llcrnrlat=22, urcrnrlat=49, lat_1=33, lat_2=45, lon_0=-95, area_thresh=10000) m = draw_USA() # Draw map background m.fillcontinents(color='white', lake_color='#eeeeee') m.drawstates(color='lightgray') m.drawcoastlines(color='lightgray') m.drawcountries(color='lightgray') m.drawmapboundary(fill_color='#eeeeee') # Plot the values in Starbucks Green and Dunkin Donuts Orange style = dict(s=5, marker='o', alpha=0.5, zorder=2) m.scatter(lon_s, lat_s, latlon=True, label="Starbucks", color='#00592D', **style) m.scatter(lon_d, lat_d, latlon=True, label="Dunkin' Donuts", color='#FC772A', **style) plt.legend(loc='lower left', frameon=False);
  • 70. Distributed 70 Pythonic Multiple-machine Parallelism that understands Dask graphs 1) Defines Center (dcenter) and Worker (dworker) 2) Simplified setup with dcluster for example — Center dcluster 192.168.0.{1,2,3,4} dcluster —hostfile hostfile.txt or 3) Create Executor objects like
 concurrent.futures (Python 3) or
 futures (Python 2.7 back-port) 4) Data locality supported with ad-hoc task graphs
 by returning futures wherever possible New Libray but stabilizing quickly — communicate with blaze-dev@continuum.io
  • 71. Python and Hadoop (without the JVM) 71 Chat:    http://gitter.im/blaze/dask   Email:    blaze-­‐dev@continuum.io Join  the  conversation!
  • 72. HDFS without Java 72 1. HDFS splits large files into many small blocks replicated on many datanodes 2. For efficient computation we must use data directly on datanodes 3. distributed.hdfs queries the locations of the individual blocks 4. distributed executes functions directly on those blocks on the datanodes 5. distributed+pandas enables distributed CSV processing on HDFS in pure Python 6. Coming soon — dask on hdfs
  • 73. 73 $ hdfs dfs -cp yellow_tripdata_2014-01.csv /data/nyctaxi/ >>> from distributed import hdfs >>> blocks = hdfs.get_locations('/data/nyctaxi/', '192.168.50.100', 9000) >>> columns = ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'rate_code', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tip_amount', 'tolls_amount', 'total_amount'] >>> from distributed import Executor >>> executor = Executor('192.168.1.100:8787') >>> dfs = [executor.submit(pd.read_csv, block['path'], workers=block['hosts'], ... columns=columns, skiprows=1) ... for block in blocks] These operations produce Future objects that point to remote results on the worker computers. This does not pull results back to local memory. We can use these futures in later computations with the executor.
  • 74. 74 def sum_series(seq): result = seq[0] for s in seq[1:]: result = result.add(s, fill_value=0) return result >>> counts = executor.map(lambda df: df.passenger_count.value_counts(), dfs) >>> total = executor.submit(sum_series, counts) >>> total.result() 0 259 1 9727301 2 1891581 3 566248 4 267540 5 789070 6 540444 7 7 8 5 9 16 208 19
  • 75. Bokeh 75 http://bokeh.pydata.org • Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • No need to write Javascript
  • 78. Previous: Javascript code generation 78 server.py Browser js_str = """ <d3.js> <highchart.js> <etc.js> """ plot.js.template App Model D3 highcharts flot crossfilter etc. ... One-shot; no MVC interaction; no data streaming HTML
  • 79. bokeh.py & bokeh.js 79 server.py BrowserApp Model BokehJS object graph bokeh-server bokeh.py object graph JSON
  • 80. 80
  • 83. 83
  • 84. 84
  • 85. 85
  • 87. Additional Demos & Topics 87 • Airline flights • Pandas table • Streaming / Animation • Large data rendering
  • 88. 88 • Dynamic, just-in-time compiler for Python & NumPy • Uses LLVM • Outputs x86 and GPU (CUDA, HSA) • (Premium version is in Accelerate part of 
 Anaconda Workgroup and Anaconda Enterprise subscriptions) http://numba.pydata.org Numba
  • 89. Python Compilation Space 89 Ahead Of Time Just In Time Relies on CPython / libpython Cython Shedskin Nuitka (today) Pythran Numba HOPE Theano Replaces CPython / libpython Nuitka (future) Pyston PyPy
  • 91. 91 @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
  • 92. Numba Features • Numba supports: – Windows, OS X, and Linux – 32 and 64-bit x86 CPUs and NVIDIA GPUs – Python 2 and 3 – NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter
 (all of your existing Python libraries are still available) 92
  • 93. Numba Modes • object mode: Compiled code operates on Python objects. Only significant performance improvement is compilation of loops that can be compiled in nopython mode (see below). • nopython mode: Compiled code operates on “machine native” data. Usually within 25% of the performance of equivalent C or FORTRAN. 93
  • 95. The Basics 95 Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array 2.7x speedup! Numba decorator
 (nopython=True not required)
  • 96. CUDA Python (in open-source Numba!) 96 CUDA Development using Python syntax for optimal performance! You have to understand CUDA at least a little — writing kernels that launch in parallel on the GPU
  • 98. Black-Scholes: Results 98 core i7 GeForce GTX 560 Ti About 9x faster on this GPU ~ same speed as CUDA-C
  • 99. Other interesting things • CUDA Simulator to debug your code in Python interpreter • Generalized ufuncs (@guvectorize) • Call ctypes and cffi functions directly and pass them as arguments • Preliminary support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: http://numba.pydata.org/numba-doc/0.20.0/ 99
  • 100. What Doesn’t Work? (A non-comprehensive list) • Sets, lists, dictionaries, user defined classes (tuples do work!) • List, set and dictionary comprehensions • Recursion • Exceptions with non-constant parameters • Most string operations (buffer support is very preliminary!) • yield from • closures inside a JIT function (compiling JIT functions inside a closure works…) • Modifying globals • Passing an axis argument to numpy array reduction functions • Easy debugging (you have to debug in Python mode). 100
  • 101. How Numba Works 101 Bytecode Analysis Python Function (bytecode) Function Arguments Type Inference Numba IR Rewrite IR Lowering LLVM IRLLVM JITMachine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute!
  • 102. Recently Added Numba Features • A new GPU target: the Heterogenous System Architecture, supported by AMD APUs • Support for named tuples in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) — both LLVM and pre-compiled • A simulator for debugging GPU functions with the Python debugger on the CPU • Can choose to release the GIL in nopython functions • Ahead of time compilation • vectorize and guvectorize on GPU and parallel targets now in open-source Numba! • JIT Classes — coming soon! 102