SlideShare ist ein Scribd-Unternehmen logo
1 von 93
Downloaden Sie, um offline zu lesen
© 2015 Continuum Analytics- Confidential & Proprietary
BIDS Data Science Seminar
Using Anaconda to light-up dark data.
Travis E. Oliphant, PhD
September 18, 2015
Started as a Scientist / Engineer
2
Images from BYU CERS Lab
Science led to Python
3
Raja Muthupillai
Armando Manduca
Richard Ehman
Jim Greenleaf
1997
⇢0 (2⇡f)
2
Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)],j
⌅ = r ⇥ U
“Distractions” led to my calling
4
5
Latest Cosmological Theory
6
Dark Data: CSV, hdf5, npz, logs, emails, and other files
in your company outside a traditional store
7
Dark Data: CSV, hdf5, npz, logs, emails, and other files
in your company outside a traditional store
8
Database Approach
Data
Sources
Data Store
Data
Sources
Clients
9
Bring the Database to the Data
Data
Sources
Data
Sources
Clients
Blaze
(datashape,dask)
NumPy,Pandas,
SciPy,sklearn,etc.
(for analytics)
Anaconda — portable environments
10
conda
Python'&'R'Open'Source'Analytics
NumPy, SciPy, Pandas, Scikit;learn, Jupyter / IPython,
Numba, Matplotlib, Spyder, Numexpr, Cython, Theano,
Scikit;image, NLTK, NetworkX, IRKernel, dplyr, shiny,
ggplot2, tidyr, caret, nnet and 330+ packages
•Easy to install
•Intuitive to discover
•Quick to analyze
•Simple to collaborate
•Accessible to all
© 2015 Continuum Analytics- Confidential & Proprietary
DTYPE INNOVATION (AN ASIDE)
11
Key (potential) benefits of dtype
12
•Turns imperative code into declarative code
•Should provide a solid mechanism for u-func dispatch
Imperative to Declarative
13
NumPyIO
June 1998
My First Python Extension
Reading Analyze Data-Format
fread, fwrite
Data Storage
dtype
arr[1:10,-5].field1
Function dispatch
14
def func(*args):
key = (arg.dtype for arg in args)
return _funcmap[key](*args)
Highly Simplified! — quite a few details to do well…
WHY BLAZE?
15
Thanks to Peter Wang for slides.
16
17
Data
18
“Math”
Data
19
Math
Big Data
20
Math
Big Data
21
Math
Big Data
22
Math
Big Data
Programs
23
“General Purpose Programming”
24
Analytics System
Domain-Specific
Query Language
25
26
?
27
Expressions
Metadata
Runtime
28
+ - / * ^ []
join, groupby, filter
map, sort, take
where, topk
datashape,dtype,
shape,stride
hdf5,json,csv,xls
protobuf,avro,...
NumPy,Pandas,R,
Julia,K,SQL,Spark,
Mongo,Cassandra,...
BLAZE ECOSYSTEM
29
Thanks to Christine Doig for slides.
30
Blaze
datashape
odo
DyND
dask
castra
bcolz data description language
dynamic, multidimensional arrays
parallel computing
data migration
column store
& query
column store
blaze
interface to query data
@mrocklin
@cpcloud
@quasiben
@jcrist @cowlicks
@FrancescAlted
@mwiebe @izaid
@eriknw
@esc
Blaze Ecosystem
31
numpy
pandas
sql DB
Data Runtime Expressions
spark
datashape
metadata storage
odo
paralleloptimized
dask
numbaDyND
blaze
castra
bcolz
32
Data Runtime
Expressions
metadata
storage/containers
compute
APIs, syntax, language
datashape
blaze
dask
odo
parallelize optimize, JIT
BLAZE LIBRARY
33
Thanks to Christine Doig and Phillip Cloud for slides.
34
interface to query data on different storage systems
http://blaze.pydata.org/en/latest/
from blaze import Data
Blaze
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy,
pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
35
iris[['sepal_length', 'species']]Select columns
log(iris.sepal_length * 10)Operate
Reduce iris.sepal_length.mean()
Split-apply
-combine
by(iris.species, shortest=iris.petal_length.min(),
longest=iris.petal_length.max(),
average=iris.petal_length.mean())
Add new
columns
transform(iris,
sepal_ratio = iris.sepal_length /
iris.sepal_width,
petal_ratio = iris.petal_length / iris.petal_width)
Text matching iris.like(species='*versicolor')
iris.relabel(petal_length='PETAL-LENGTH',
petal_width='PETAL-WIDTH')
Relabel columns
Filter iris[(iris.species == 'Iris-setosa') &
(iris.sepal_length > 5.0)]
Blaze
36
datashapeblaze
Blaze uses datashape as its type system (like DyND)
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")
37
a structured data description language
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
* var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
Data Shape
38
{
flowersdb: {
iris: var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}
},
iriscsv: var * {
sepal_length: ?float64,
sepal_width: ?float64,
petal_length: ?float64,
petal_width: ?float64,
species: ?string
},
irisjson: var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
},
irismongo: 150 * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}
}
datashape
# Arrays
3 * 4 * int32
3 * 4 * int32
10 * var * float64
3 * complex[float64]
# Arrays of Structures
100 * {
name: string,
birthday: date,
address: {
street: string,
city: string,
postalcode: string,
country: string
}
}
# Structure of Arrays
{
x: 100 * 100 * float32,
y: 100 * 100 * float32,
u: 100 * 100 * float32,
v: 100 * 100 * float32,
}
# Function prototype
(3 * int32, float64) -> 3 * float64
# Function prototype with broadcasting dimensions
(A... * int32, A... * int32) -> A... * int32
iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount:
float64}"
irismongo:
source: mongodb://localhost/mydb::iris
server.yaml
YAML
39
Builds off of Blaze uniform interface to host data
remotely through a JSON web API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
Blaze Server — Lights up your Dark Data
40
Blaze Client
>>> from blaze import Data
>>> s = Data('blaze://localhost:6363')
>>> t.fields
[u'iriscsv', u'irisdb', u'irisjson', u’irismongo']
>>> t.iriscsv
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
>>> t.irisdb
petal_length petal_width sepal_length sepal_width species
0 1.4 0.2 5.1 3.5 Iris-setosa
1 1.4 0.2 4.9 3.0 Iris-setosa
2 1.3 0.2 4.7 3.2 Iris-setosa
Blaze Server
© 2015 Continuum Analytics- Confidential & Proprietary
Compute recipes work with existing libraries and
have multiple backends.
• python list
• numpy arrays
• dynd
• pandas DataFrame
• Spark, Impala
• Mongo
• dask 41
© 2015 Continuum Analytics- Confidential & Proprietary
42
• Ideally, you can layer expressions over any
data.
• Write once, deploy anywhere.
• Practically, expressions will work better on
specific data structures, formats, and engines.
• You will need to copy from one format and/or
engine to another
ODO LIBRARY
43
Thanks to Phillip Cloud and Christine Doig for slides.
© 2015 Continuum Analytics- Confidential & Proprietary
Odo
• A library for turning things into other things
• Factored out from the blaze project
• Handles a huge variety of conversions
• odo is cp with types, for data
44
45
data migration, ~ cp with types, for data
http://odo.pydata.org/en/latest/
from odo import odo
odo(source, target)
odo('iris.json', 'mongodb://localhost/mydb::iris')
odo('iris.json', 'sqlite:///flowers.db::iris')
odo('iris.csv', 'iris.json')
odo('iris.csv', 'hdfs://hostname:iris.csv')
odo('hive://hostname/default::iris_csv',
'hive://hostname/default::iris_parquet',
stored_as='PARQUET', external=False)
odo
© 2015 Continuum Analytics- Confidential & Proprietary
46
Through a network of
conversions
How Does It Work?
© 2015 Continuum Analytics- Confidential & Proprietary
47
Each node is a type (DataFrame, list, sqlalchemy.Table, etc...)
Each edge is a conversion function
© 2015 Continuum Analytics- Confidential & Proprietary
It’s extensible!
48
© 2015 Continuum Analytics- Confidential & Proprietary
DASK
49
Thanks to Christine Doig and Blake Griffith for slides
50
enables parallel computing
http://dask.pydata.org/en/latest/
parallel computing
shared
memory
distributed
cluster
single core
computing
Gigabyte
Fits in memory
Terabyte
Fits on disk
Petabyte
Fits on many disks
dask
51
enables parallel computing
http://dask.pydata.org/en/latest/
parallel computing
shared
memory
distributed
cluster
single core
computing
Gigabyte
Fits in memory
Terabyte
Fits on disk
Petabyte
Fits on many disks
numpy, pandas dask
dask.distributed
dask
52
enables parallel computing
http://dask.pydata.org/en/latest/
parallel computing
shared
memory
distributed
cluster
single core
computing
numpy, pandas dask
dask.distributed
threaded scheduler multiprocessing scheduler
dask
53
numpy dask
>>> import numpy as np
>>> np_ones = np.ones((5000, 1000))
>>> np_ones
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> np_y = np.log(np_ones + 1)[:5].sum(axis=1)
>>> np_y
array([ 693.14718056, 693.14718056,
693.14718056, 693.14718056, 693.14718056])
>>> import dask.array as da
>>> da_ones = da.ones((5000000, 1000000),
chunks=(1000, 1000))
>>> da_ones.compute()
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> da_y = da.log(da_ones + 1)[:5].sum(axis=1)
>>> np_da_y = np.array(da_y) #fits in memory
array([ 693.14718056, 693.14718056,
693.14718056, 693.14718056, …, 693.14718056])
# Result doesn’t fit in memory
>>> da_y.to_hdf5('myfile.hdf5', 'result')
dask array
54
pandas dask
>>> import pandas as pd
>>> df = pd.read_csv('iris.csv')
>>> df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
>>> max_sepal_length_setosa = df[df.species ==
'setosa'].sepal_length.max()
5.7999999999999998
>>> import dask.dataframe as dd
>>> ddf = dd.read_csv('*.csv')
>>> ddf.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
…
>>> d_max_sepal_length_setosa = ddf[ddf.species ==
'setosa'].sepal_length.max()
>>> d_max_sepal_length_setosa.compute()
5.7999999999999998
dask dataframe
55
semi-structure data, like
JSON blobs or log files
>>> import dask.bag as db
>>> import json
# Get tweets as a dask.bag from compressed json files
>>> b = db.from_filenames('*.json.gz').map(json.loads)
# Take two items in dask.bag
>>> b.take(2)
({u'contributors': None,
u'coordinates': None,
u'created_at': u'Fri Oct 10 17:19:35 +0000 2014',
u'entities': {u'hashtags': [],
u'symbols': [],
u'trends': [],
u'urls': [],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'filter_level': u'medium',
u'geo': None …
# Count the frequencies of user locations
>>> freq = b.pluck('user').pluck('location').frequencies()
# Get the result as a dataframe
>>> df = freq.to_dataframe()
>>> df.compute()
0 1
0 20916
1 Natal 2
2 Planet earth. Sheffield. 1
3 Mad, USERA 1
4 Brasilia DF - Brazil 2
5 Rondonia Cacoal 1
6 msftsrep || 4/5. 1
dask bag
56
>>> import dask
>>> from dask.distributed import Client
# client connected to 50 nodes, 2 workers per node.
>>> dc = Client('tcp://localhost:9000')
# or
>>> dc = Client('tcp://ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:9000')
>>> b = db.from_s3('githubarchive-data', '2015-*.json.gz').map(json.loads)
# use default single node scheduler
>>> top_commits.compute()
# use client with distributed cluster
>>> top_commits.compute(get=dc.get)
[(u'mirror-updates', 1463019),
(u'KenanSulayman', 235300),
(u'greatfirebot', 167558),
(u'rydnr', 133323),
(u'markkcc', 127625)]
dask distributed
57
daskblaze
e.g. we can drive dask arrays with blaze.
>>> x = da.from_array(...) # Make a dask array
>>> from blaze import Data, log, compute
>>> d = Data(x) # Wrap with Blaze
>>> y = log(d + 1)[:5].sum(axis=1) # Do work as usual
>>> result = compute(y) # Fall back to dask
dask can be a backend/engine for blaze
•Collections build task graphs
•Schedulers execute task graphs
•Graph specification = uniting interface
58
Questions?
http://dask.pydata.org
59
NUMBA
60
Thanks to Stan Seibert for slides
Space of Python Compilation
61
Ahead Of Time Just In Time
Relies on
CPython /
libpython
Cython
Shedskin
Nuitka (today)
Pythran
Numba
Numba
HOPE
Theano
Pyjion
Replaces
CPython /
libpython
Nuitka (future) Pyston
PyPy
Compiler overview
62
Intermediate
Representation
(IR)
x86
C++
ARM
PTX
C
Fortran
ObjC
Code	
  Generation	
  	
  
Backend
Parsing	
  Frontend
Numba
63
Intermediate
Representation
(IR)
x86
ARM
PTX
Python
LLVMNumba
Parsing	
  Frontend
Code	
  Generation	
  	
  
Backend
Example
64
Numba
How Numba works
65
Bytecode
Analysis
Python
Function
Function
Arguments
Type
Inference
Numba IR
LLVM IR
Machine
Code
@jit
def do_math(a,b):
…
>>> do_math(x, y)
Cache
Execute!
Rewrite IR
Lowering
LLVM JIT
Numba Features
66
• Numba supports:
Windows, OS X, and Linux
32 and 64-bit x86 CPUs and NVIDIA GPUs
Python 2 and 3
NumPy versions 1.6 through 1.9
• Does not require a C/C++ compiler on the user’s system.
• < 70 MB to install.
• Does not replace the standard Python interpreter

(all of your existing Python libraries are still available)
Numba Modes
67
• object mode: Compiled code operates on Python
objects. Only significant performance improvement is
compilation of loops that can be compiled in nopython
mode (see below).
• nopython mode: Compiled code operates on “machine
native” data. Usually within 25% of the performance of
equivalent C or FORTRAN.
How to Use Numba
68
1. Create a realistic benchmark test case.

(Do not use your unit tests as a benchmark!)
2. Run a profiler on your benchmark.

(cProfile is a good choice)
3. Identify hotspots that could potentially be compiled by Numba with a
little refactoring.

(see rest of this talk and online documentation)
4. Apply @numba.jit and @numba.vectorize as needed to critical
functions. 

(Small rewrites may be needed to work around Numba limitations.)
5. Re-run benchmark to check if there was a performance improvement.
A Whirlwind Tour of Numba Features
69
• Sometimes you can’t create a simple or efficient array
expression or ufunc. Use Numba to work with array
elements directly.
• Example: Suppose you have a boolean grid and you
want to find the maximum number neighbors a cell has
in the grid:
The Basics
70
The Basics
71
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup!
Numba decorator

(nopython=True not required)
Calling Other Functions
72
Calling Other Functions
73
This function is not
inlined
This function is inlined
9.8x speedup compared to doing
this with numpy functions
Making Ufuncs
74
Making Ufuncs
75
Monte Carlo simulating 500,000 tournaments in 50 ms
Case-study -- j0 from scipy.special
76
• scipy.special was one of the first libraries I wrote (in 1999)
• extended “umath” module by adding new “universal functions” to
compute many scientific functions by wrapping C and Fortran libs.
• Bessel functions are solutions to a differential equation:
x2 d2
y
dx2
+ x
dy
dx
+ (x2
↵2
)y = 0
y = J↵ (x)
Jn (x) =
1
⇡
Z ⇡
0
cos (n⌧ x sin (⌧)) d⌧
scipy.special.j0 wraps cephes algorithm
77
Don’t	
  need	
  this	
  anymore!
Result --- equivalent to compiled code
78
In [6]: %timeit vj0(x)
10000 loops, best of 3: 75 us per loop
In [7]: from scipy.special import j0
In [8]: %timeit j0(x)
10000 loops, best of 3: 75.3 us per loop
But! Now code is in Python and can be experimented with
more easily (and moved to the GPU / accelerator more easily)!
Word starting to get out!
79
Recent	
  numba	
  mailing	
  list	
  reports	
  experiments	
  of	
  a	
  SciPy	
  author	
  who	
  got	
  2x	
  
speed-­‐up	
  by	
  removing	
  their	
  Cython	
  type	
  annotations	
  and	
  surrounding	
  
function	
  with	
  numba.jit	
  (with	
  a	
  few	
  minor	
  changes	
  needed	
  to	
  the	
  code).
As	
  soon	
  as	
  Numba’s	
  ahead-­‐of-­‐time	
  compilation	
  moves	
  beyond	
  experimental	
  
stage	
  one	
  can	
  legitimately	
  use	
  Numba	
  to	
  create	
  a	
  library	
  that	
  you	
  ship	
  to	
  
others	
  (who	
  then	
  don’t	
  need	
  to	
  have	
  Numba	
  installed	
  —	
  or	
  just	
  need	
  a	
  Numba	
  
run-­‐time	
  installed).
SciPy	
  (and	
  NumPy)	
  would	
  look	
  very	
  different	
  in	
  Numba	
  had	
  existed	
  16	
  years	
  
ago	
  when	
  SciPy	
  was	
  getting	
  started….	
  —	
  and	
  you	
  would	
  all	
  be	
  happier.
Generators
80
Releasing the GIL
81
Many	
  fret	
  about	
  the	
  GIL	
  in	
  Python	
  
With	
  PyData	
  Stack	
  you	
  often	
  have	
  multi-­‐threaded	
  
In	
  PyData	
  Stack	
  we	
  quite	
  often	
  release	
  GIL	
  
NumPy	
  does	
  it	
  
SciPy	
  does	
  it	
  (quite	
  often)	
  
Scikit-­‐learn	
  (now)	
  does	
  it	
  
Pandas	
  (now)	
  does	
  it	
  when	
  possible	
  
Cython	
  makes	
  it	
  easy	
  
Numba	
  makes	
  it	
  easy
Releasing the GIL
82
Only nopython mode
functions can release
the GIL
Releasing the GIL
83
2.8x speedup with 4 cores
CUDA Python (in open-source Numba!)
84
CUDA Development
using Python syntax for
optimal performance!
You have to understand
CUDA at least a little —
writing kernels that
launch in parallel on the
GPU
Example: Black-Scholes
85
Black-Scholes: Results
86
core i7
GeForce GTX
560 Ti About 9x
faster on this
GPU
~ same speed
as CUDA-C
Other interesting things
87
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize)
• Call ctypes and cffi functions directly and pass them as
arguments
• Preliminary support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled
code
• See: http://numba.pydata.org/numba-doc/0.20.0/
What Doesn’t Work?
88
(A non-comprehensive list)
• Sets, lists, dictionaries, user defined classes (tuples do work!)
• List, set and dictionary comprehensions
• Recursion
• Exceptions with non-constant parameters
• Most string operations (buffer support is very preliminary!)
• yield from
• closures inside a JIT function (compiling JIT functions inside a closure works…)
• Modifying globals
• Passing an axis argument to numpy array reduction functions
• Easy debugging (you have to debug in Python mode).
The (Near) Future
89
(Also a non-comprehensive list)
• “JIT Classes”
• Better support for strings/bytes, buffers, and parsing use-
cases
• More coverage of the Numpy API (advanced indexing, etc)
• Documented extension API for adding your own types, low
level function implementations, and targets.
• Better debug workflows
Recently Added Numba Features
90
• Recently Added Numba Features
• A new GPU target: the Heterogenous System Architecture, supported
by AMD APUs
• Support for named tuples in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in)
• A simulator for debugging GPU functions with the Python debugger
on the CPU
• Can choose to release the GIL in nopython functions
• Many speed improvements
© 2015 Continuum Analytics- Confidential & Proprietary
New Features
• Support for ARMv7 (Raspbery Pi 2)
• Python 3.5 support
• NumPy 1.10 support
• Faster loading of pre-compiled functions from the disk
cache
• ufunc compilation for multithreaded CPU and GPU targets
(features only in NumbaPro previously).
91
Conclusion
92
• Lots of progress in the past year!
• Try out Numba on your numerical and Numpy-related
projects:
conda install numba
• Your feedback helps us make Numba better!

Tell us what you would like to see:



https://github.com/numba/numba
• Stay tuned for more exciting stuff this year…
© 2015 Continuum Analytics- Confidential & Proprietary
Thanks
September 18, 2015
•DARPA XDATA program (Chris White and Wade Shen) which helped fund
Numba, Blaze, Dask and Odo.
•Investors of Continuum.
•Clients and Customers of Continuum who help support these projects.
•Numfocus volunteers
•PyData volunteers

Weitere ähnliche Inhalte

Was ist angesagt?

PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Databricks
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonDaniel Rodriguez
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...jaxLondonConference
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for HadoopJim Dowling
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt DowleSri Ambati
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 

Was ist angesagt? (17)

PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
 
PyData Introduction
PyData IntroductionPyData Introduction
PyData Introduction
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowle
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 

Andere mochten auch

Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de venCreative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de venPyData
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Interactive Visualization With Bokeh (SF Python Meetup)
Interactive Visualization With Bokeh (SF Python Meetup)Interactive Visualization With Bokeh (SF Python Meetup)
Interactive Visualization With Bokeh (SF Python Meetup)Peter Wang
 
Contract and recruitment methods
Contract and recruitment methodsContract and recruitment methods
Contract and recruitment methodscallumharrison
 
Venue Contract Negotiation
Venue Contract NegotiationVenue Contract Negotiation
Venue Contract NegotiationScott Stedronsky
 
- Contract Recruitment Manager -
- Contract Recruitment Manager -- Contract Recruitment Manager -
- Contract Recruitment Manager -Matthew Gostelow
 
Jenny Hill's Tips For Bid Writing
Jenny Hill's Tips For Bid WritingJenny Hill's Tips For Bid Writing
Jenny Hill's Tips For Bid WritingJenny Hill
 
Sap Contract Recruitment Consultant - London
Sap Contract Recruitment Consultant - LondonSap Contract Recruitment Consultant - London
Sap Contract Recruitment Consultant - Londondaniellewilkinson
 
Ritchie Bros Dubai February Unreserved auction - NO MINIMUM BIDS
Ritchie Bros Dubai February Unreserved auction - NO MINIMUM BIDS Ritchie Bros Dubai February Unreserved auction - NO MINIMUM BIDS
Ritchie Bros Dubai February Unreserved auction - NO MINIMUM BIDS Phillip A. Weston
 
BIDS CPD Workshop Nov 2011
BIDS CPD Workshop Nov 2011BIDS CPD Workshop Nov 2011
BIDS CPD Workshop Nov 2011Barry McCulloch
 
Physician Contract from Recruitment to Retirement
Physician Contract from Recruitment to RetirementPhysician Contract from Recruitment to Retirement
Physician Contract from Recruitment to RetirementHealthcare_Pro
 
Silent Auction Secrets: 5 simple changes to generate bigger bids
Silent Auction Secrets: 5 simple changes to generate bigger bidsSilent Auction Secrets: 5 simple changes to generate bigger bids
Silent Auction Secrets: 5 simple changes to generate bigger bids4Good.org
 

Andere mochten auch (13)

Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de venCreative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Interactive Visualization With Bokeh (SF Python Meetup)
Interactive Visualization With Bokeh (SF Python Meetup)Interactive Visualization With Bokeh (SF Python Meetup)
Interactive Visualization With Bokeh (SF Python Meetup)
 
Contract and recruitment methods
Contract and recruitment methodsContract and recruitment methods
Contract and recruitment methods
 
Venue Contract Negotiation
Venue Contract NegotiationVenue Contract Negotiation
Venue Contract Negotiation
 
- Contract Recruitment Manager -
- Contract Recruitment Manager -- Contract Recruitment Manager -
- Contract Recruitment Manager -
 
Jenny Hill's Tips For Bid Writing
Jenny Hill's Tips For Bid WritingJenny Hill's Tips For Bid Writing
Jenny Hill's Tips For Bid Writing
 
Sap Contract Recruitment Consultant - London
Sap Contract Recruitment Consultant - LondonSap Contract Recruitment Consultant - London
Sap Contract Recruitment Consultant - London
 
Ritchie Bros Dubai February Unreserved auction - NO MINIMUM BIDS
Ritchie Bros Dubai February Unreserved auction - NO MINIMUM BIDS Ritchie Bros Dubai February Unreserved auction - NO MINIMUM BIDS
Ritchie Bros Dubai February Unreserved auction - NO MINIMUM BIDS
 
BIDS CPD Workshop Nov 2011
BIDS CPD Workshop Nov 2011BIDS CPD Workshop Nov 2011
BIDS CPD Workshop Nov 2011
 
Physician Contract from Recruitment to Retirement
Physician Contract from Recruitment to RetirementPhysician Contract from Recruitment to Retirement
Physician Contract from Recruitment to Retirement
 
Silent Auction Secrets: 5 simple changes to generate bigger bids
Silent Auction Secrets: 5 simple changes to generate bigger bidsSilent Auction Secrets: 5 simple changes to generate bigger bids
Silent Auction Secrets: 5 simple changes to generate bigger bids
 

Ähnlich wie Bids talk 9.18

Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code EuropeDavid Pilato
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
CrateDB 101: Sensor data
CrateDB 101: Sensor dataCrateDB 101: Sensor data
CrateDB 101: Sensor dataClaus Matzinger
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataCrate.io
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherDatabricks
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 

Ähnlich wie Bids talk 9.18 (20)

Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
CrateDB 101: Sensor data
CrateDB 101: Sensor dataCrateDB 101: Sensor data
CrateDB 101: Sensor data
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor Data
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 

Mehr von Travis Oliphant

Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataTravis Oliphant
 
SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019Travis Oliphant
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019Travis Oliphant
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationTravis Oliphant
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsTravis Oliphant
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonTravis Oliphant
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyTravis Oliphant
 

Mehr von Travis Oliphant (11)

Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
 
SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for Python
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
Numba
NumbaNumba
Numba
 

Kürzlich hochgeladen

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 

Kürzlich hochgeladen (20)

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 

Bids talk 9.18

  • 1. © 2015 Continuum Analytics- Confidential & Proprietary BIDS Data Science Seminar Using Anaconda to light-up dark data. Travis E. Oliphant, PhD September 18, 2015
  • 2. Started as a Scientist / Engineer 2 Images from BYU CERS Lab
  • 3. Science led to Python 3 Raja Muthupillai Armando Manduca Richard Ehman Jim Greenleaf 1997 ⇢0 (2⇡f) 2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)],j ⌅ = r ⇥ U
  • 6. 6 Dark Data: CSV, hdf5, npz, logs, emails, and other files in your company outside a traditional store
  • 7. 7 Dark Data: CSV, hdf5, npz, logs, emails, and other files in your company outside a traditional store
  • 9. 9 Bring the Database to the Data Data Sources Data Sources Clients Blaze (datashape,dask) NumPy,Pandas, SciPy,sklearn,etc. (for analytics)
  • 10. Anaconda — portable environments 10 conda Python'&'R'Open'Source'Analytics NumPy, SciPy, Pandas, Scikit;learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit;image, NLTK, NetworkX, IRKernel, dplyr, shiny, ggplot2, tidyr, caret, nnet and 330+ packages •Easy to install •Intuitive to discover •Quick to analyze •Simple to collaborate •Accessible to all
  • 11. © 2015 Continuum Analytics- Confidential & Proprietary DTYPE INNOVATION (AN ASIDE) 11
  • 12. Key (potential) benefits of dtype 12 •Turns imperative code into declarative code •Should provide a solid mechanism for u-func dispatch
  • 13. Imperative to Declarative 13 NumPyIO June 1998 My First Python Extension Reading Analyze Data-Format fread, fwrite Data Storage dtype arr[1:10,-5].field1
  • 14. Function dispatch 14 def func(*args): key = (arg.dtype for arg in args) return _funcmap[key](*args) Highly Simplified! — quite a few details to do well…
  • 15. WHY BLAZE? 15 Thanks to Peter Wang for slides.
  • 16. 16
  • 25. 25
  • 26. 26 ?
  • 28. 28 + - / * ^ [] join, groupby, filter map, sort, take where, topk datashape,dtype, shape,stride hdf5,json,csv,xls protobuf,avro,... NumPy,Pandas,R, Julia,K,SQL,Spark, Mongo,Cassandra,...
  • 29. BLAZE ECOSYSTEM 29 Thanks to Christine Doig for slides.
  • 30. 30 Blaze datashape odo DyND dask castra bcolz data description language dynamic, multidimensional arrays parallel computing data migration column store & query column store blaze interface to query data @mrocklin @cpcloud @quasiben @jcrist @cowlicks @FrancescAlted @mwiebe @izaid @eriknw @esc Blaze Ecosystem
  • 31. 31 numpy pandas sql DB Data Runtime Expressions spark datashape metadata storage odo paralleloptimized dask numbaDyND blaze castra bcolz
  • 32. 32 Data Runtime Expressions metadata storage/containers compute APIs, syntax, language datashape blaze dask odo parallelize optimize, JIT
  • 33. BLAZE LIBRARY 33 Thanks to Christine Doig and Phillip Cloud for slides.
  • 34. 34 interface to query data on different storage systems http://blaze.pydata.org/en/latest/ from blaze import Data Blaze iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv')S3 … Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
  • 35. 35 iris[['sepal_length', 'species']]Select columns log(iris.sepal_length * 10)Operate Reduce iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) Text matching iris.like(species='*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)] Blaze
  • 36. 36 datashapeblaze Blaze uses datashape as its type system (like DyND) >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")
  • 37. 37 a structured data description language http://datashape.pydata.org/ dimension dtype unit types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels Data Shape
  • 38. 38 { flowersdb: { iris: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } datashape # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64] # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32
  • 39. iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape: "var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris server.yaml YAML 39 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json Blaze Server — Lights up your Dark Data
  • 40. 40 Blaze Client >>> from blaze import Data >>> s = Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa Blaze Server
  • 41. © 2015 Continuum Analytics- Confidential & Proprietary Compute recipes work with existing libraries and have multiple backends. • python list • numpy arrays • dynd • pandas DataFrame • Spark, Impala • Mongo • dask 41
  • 42. © 2015 Continuum Analytics- Confidential & Proprietary 42 • Ideally, you can layer expressions over any data. • Write once, deploy anywhere. • Practically, expressions will work better on specific data structures, formats, and engines. • You will need to copy from one format and/or engine to another
  • 43. ODO LIBRARY 43 Thanks to Phillip Cloud and Christine Doig for slides.
  • 44. © 2015 Continuum Analytics- Confidential & Proprietary Odo • A library for turning things into other things • Factored out from the blaze project • Handles a huge variety of conversions • odo is cp with types, for data 44
  • 45. 45 data migration, ~ cp with types, for data http://odo.pydata.org/en/latest/ from odo import odo odo(source, target) odo('iris.json', 'mongodb://localhost/mydb::iris') odo('iris.json', 'sqlite:///flowers.db::iris') odo('iris.csv', 'iris.json') odo('iris.csv', 'hdfs://hostname:iris.csv') odo('hive://hostname/default::iris_csv', 'hive://hostname/default::iris_parquet', stored_as='PARQUET', external=False) odo
  • 46. © 2015 Continuum Analytics- Confidential & Proprietary 46 Through a network of conversions How Does It Work?
  • 47. © 2015 Continuum Analytics- Confidential & Proprietary 47 Each node is a type (DataFrame, list, sqlalchemy.Table, etc...) Each edge is a conversion function
  • 48. © 2015 Continuum Analytics- Confidential & Proprietary It’s extensible! 48
  • 49. © 2015 Continuum Analytics- Confidential & Proprietary DASK 49 Thanks to Christine Doig and Blake Griffith for slides
  • 50. 50 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed cluster single core computing Gigabyte Fits in memory Terabyte Fits on disk Petabyte Fits on many disks dask
  • 51. 51 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed cluster single core computing Gigabyte Fits in memory Terabyte Fits on disk Petabyte Fits on many disks numpy, pandas dask dask.distributed dask
  • 52. 52 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed cluster single core computing numpy, pandas dask dask.distributed threaded scheduler multiprocessing scheduler dask
  • 53. 53 numpy dask >>> import numpy as np >>> np_ones = np.ones((5000, 1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # Result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result') dask array
  • 54. 54 pandas dask >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 dask dataframe
  • 55. 55 semi-structure data, like JSON blobs or log files >>> import dask.bag as db >>> import json # Get tweets as a dask.bag from compressed json files >>> b = db.from_filenames('*.json.gz').map(json.loads) # Take two items in dask.bag >>> b.take(2) ({u'contributors': None, u'coordinates': None, u'created_at': u'Fri Oct 10 17:19:35 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': None … # Count the frequencies of user locations >>> freq = b.pluck('user').pluck('location').frequencies() # Get the result as a dataframe >>> df = freq.to_dataframe() >>> df.compute() 0 1 0 20916 1 Natal 2 2 Planet earth. Sheffield. 1 3 Mad, USERA 1 4 Brasilia DF - Brazil 2 5 Rondonia Cacoal 1 6 msftsrep || 4/5. 1 dask bag
  • 56. 56 >>> import dask >>> from dask.distributed import Client # client connected to 50 nodes, 2 workers per node. >>> dc = Client('tcp://localhost:9000') # or >>> dc = Client('tcp://ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:9000') >>> b = db.from_s3('githubarchive-data', '2015-*.json.gz').map(json.loads) # use default single node scheduler >>> top_commits.compute() # use client with distributed cluster >>> top_commits.compute(get=dc.get) [(u'mirror-updates', 1463019), (u'KenanSulayman', 235300), (u'greatfirebot', 167558), (u'rydnr', 133323), (u'markkcc', 127625)] dask distributed
  • 57. 57 daskblaze e.g. we can drive dask arrays with blaze. >>> x = da.from_array(...) # Make a dask array >>> from blaze import Data, log, compute >>> d = Data(x) # Wrap with Blaze >>> y = log(d + 1)[:5].sum(axis=1) # Do work as usual >>> result = compute(y) # Fall back to dask dask can be a backend/engine for blaze
  • 58. •Collections build task graphs •Schedulers execute task graphs •Graph specification = uniting interface 58
  • 60. NUMBA 60 Thanks to Stan Seibert for slides
  • 61. Space of Python Compilation 61 Ahead Of Time Just In Time Relies on CPython / libpython Cython Shedskin Nuitka (today) Pythran Numba Numba HOPE Theano Pyjion Replaces CPython / libpython Nuitka (future) Pyston PyPy
  • 65. How Numba works 65 Bytecode Analysis Python Function Function Arguments Type Inference Numba IR LLVM IR Machine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute! Rewrite IR Lowering LLVM JIT
  • 66. Numba Features 66 • Numba supports: Windows, OS X, and Linux 32 and 64-bit x86 CPUs and NVIDIA GPUs Python 2 and 3 NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter
 (all of your existing Python libraries are still available)
  • 67. Numba Modes 67 • object mode: Compiled code operates on Python objects. Only significant performance improvement is compilation of loops that can be compiled in nopython mode (see below). • nopython mode: Compiled code operates on “machine native” data. Usually within 25% of the performance of equivalent C or FORTRAN.
  • 68. How to Use Numba 68 1. Create a realistic benchmark test case.
 (Do not use your unit tests as a benchmark!) 2. Run a profiler on your benchmark.
 (cProfile is a good choice) 3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.
 (see rest of this talk and online documentation) 4. Apply @numba.jit and @numba.vectorize as needed to critical functions. 
 (Small rewrites may be needed to work around Numba limitations.) 5. Re-run benchmark to check if there was a performance improvement.
  • 69. A Whirlwind Tour of Numba Features 69 • Sometimes you can’t create a simple or efficient array expression or ufunc. Use Numba to work with array elements directly. • Example: Suppose you have a boolean grid and you want to find the maximum number neighbors a cell has in the grid:
  • 71. The Basics 71 Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array 2.7x speedup! Numba decorator
 (nopython=True not required)
  • 73. Calling Other Functions 73 This function is not inlined This function is inlined 9.8x speedup compared to doing this with numpy functions
  • 75. Making Ufuncs 75 Monte Carlo simulating 500,000 tournaments in 50 ms
  • 76. Case-study -- j0 from scipy.special 76 • scipy.special was one of the first libraries I wrote (in 1999) • extended “umath” module by adding new “universal functions” to compute many scientific functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x2 d2 y dx2 + x dy dx + (x2 ↵2 )y = 0 y = J↵ (x) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧
  • 77. scipy.special.j0 wraps cephes algorithm 77 Don’t  need  this  anymore!
  • 78. Result --- equivalent to compiled code 78 In [6]: %timeit vj0(x) 10000 loops, best of 3: 75 us per loop In [7]: from scipy.special import j0 In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!
  • 79. Word starting to get out! 79 Recent  numba  mailing  list  reports  experiments  of  a  SciPy  author  who  got  2x   speed-­‐up  by  removing  their  Cython  type  annotations  and  surrounding   function  with  numba.jit  (with  a  few  minor  changes  needed  to  the  code). As  soon  as  Numba’s  ahead-­‐of-­‐time  compilation  moves  beyond  experimental   stage  one  can  legitimately  use  Numba  to  create  a  library  that  you  ship  to   others  (who  then  don’t  need  to  have  Numba  installed  —  or  just  need  a  Numba   run-­‐time  installed). SciPy  (and  NumPy)  would  look  very  different  in  Numba  had  existed  16  years   ago  when  SciPy  was  getting  started….  —  and  you  would  all  be  happier.
  • 81. Releasing the GIL 81 Many  fret  about  the  GIL  in  Python   With  PyData  Stack  you  often  have  multi-­‐threaded   In  PyData  Stack  we  quite  often  release  GIL   NumPy  does  it   SciPy  does  it  (quite  often)   Scikit-­‐learn  (now)  does  it   Pandas  (now)  does  it  when  possible   Cython  makes  it  easy   Numba  makes  it  easy
  • 82. Releasing the GIL 82 Only nopython mode functions can release the GIL
  • 83. Releasing the GIL 83 2.8x speedup with 4 cores
  • 84. CUDA Python (in open-source Numba!) 84 CUDA Development using Python syntax for optimal performance! You have to understand CUDA at least a little — writing kernels that launch in parallel on the GPU
  • 86. Black-Scholes: Results 86 core i7 GeForce GTX 560 Ti About 9x faster on this GPU ~ same speed as CUDA-C
  • 87. Other interesting things 87 • CUDA Simulator to debug your code in Python interpreter • Generalized ufuncs (@guvectorize) • Call ctypes and cffi functions directly and pass them as arguments • Preliminary support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: http://numba.pydata.org/numba-doc/0.20.0/
  • 88. What Doesn’t Work? 88 (A non-comprehensive list) • Sets, lists, dictionaries, user defined classes (tuples do work!) • List, set and dictionary comprehensions • Recursion • Exceptions with non-constant parameters • Most string operations (buffer support is very preliminary!) • yield from • closures inside a JIT function (compiling JIT functions inside a closure works…) • Modifying globals • Passing an axis argument to numpy array reduction functions • Easy debugging (you have to debug in Python mode).
  • 89. The (Near) Future 89 (Also a non-comprehensive list) • “JIT Classes” • Better support for strings/bytes, buffers, and parsing use- cases • More coverage of the Numpy API (advanced indexing, etc) • Documented extension API for adding your own types, low level function implementations, and targets. • Better debug workflows
  • 90. Recently Added Numba Features 90 • Recently Added Numba Features • A new GPU target: the Heterogenous System Architecture, supported by AMD APUs • Support for named tuples in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) • A simulator for debugging GPU functions with the Python debugger on the CPU • Can choose to release the GIL in nopython functions • Many speed improvements
  • 91. © 2015 Continuum Analytics- Confidential & Proprietary New Features • Support for ARMv7 (Raspbery Pi 2) • Python 3.5 support • NumPy 1.10 support • Faster loading of pre-compiled functions from the disk cache • ufunc compilation for multithreaded CPU and GPU targets (features only in NumbaPro previously). 91
  • 92. Conclusion 92 • Lots of progress in the past year! • Try out Numba on your numerical and Numpy-related projects: conda install numba • Your feedback helps us make Numba better!
 Tell us what you would like to see:
 
 https://github.com/numba/numba • Stay tuned for more exciting stuff this year…
  • 93. © 2015 Continuum Analytics- Confidential & Proprietary Thanks September 18, 2015 •DARPA XDATA program (Chris White and Wade Shen) which helped fund Numba, Blaze, Dask and Odo. •Investors of Continuum. •Clients and Customers of Continuum who help support these projects. •Numfocus volunteers •PyData volunteers