Massively Parallel Processing with Procedural Python by Ronert Obst PyData Berlin 2014

BUILT FOR THE SPEED OF BUSINESS

@ronert_obst
2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved.
Massively Parallel Processing
with Procedural Python
How do we use the PyData stack in data science
engagements at Pivotal?
Ian Huston, Ronert Obst
Data Scientists, Pivotal

@ronert_obst
3© Copyright 2014 Pivotal. All rights reserved.
Some Links for this talk
Ÿ  Simple code examples:
https://github.com/ronert/plpython_examples
Ÿ  IPython notebook rendered with nbviewer:
http://bit.ly/1u5t1Mj

@ronert_obst
About Pivotal
Agile PaaS Big Data
Pivotal™
HD
with GemFire XD®
Pivotal™
Greenplum
Database
Pivotal™
GemFire®

@ronert_obst
What do our customers look like?
Ÿ  Large enterprises with lots of data collected
–  Work with PBs of data, structured & unstructured
Ÿ  Not able to get what they want out of their data
–  Legacy systems
–  Slow response times
–  Small Samples
Ÿ  Transform into data driven enterprises

@ronert_obst
Open Source is Pivotal

@ronert_obst
Pivotal’s Open Source Contributions
Lots more interesting small projects:
•  PyMADlib – Python Wrapper for MADlib
https://github.com/gopivotal/pymadlib
•  PivotalR – R wrapper for MADlib
http://github.com/madlib-internal/PivotalR
•  Part-of-speech tagger for Twitter via SQL
http://vatsan.github.io/gp-ark-tweet-nlp/
•  Pandas via psql
(interactive PostgreSQL terminal)
https://github.com/vatsan/pandas_via_psql

@ronert_obst
Typical Engagement Tech Setup
Ÿ  Platform:
–  Greenplum Analytics Database (GPDB)
–  Pivotal HD Hadoop Distribution + HAWQ (SQL on Hadoop)
Ÿ  Open Source Options (http://gopivotal.com):
–  Greenplum Community Edition
–  Pivotal HD Community Edition (HAWQ not included)
–  MADlib in-database machine learning library (http://madlib.net)
Ÿ  Where Python fits in:
–  IPython Notebook for exploratory analysis
–  PL/Python running in-database, with nltk, scikit-learn etc
–  Pandas, Matplotlib etc.
Pivotal™
HD
with GemFire XD®
Pivotal™
Greenplum
Database

@ronert_obst
PL/Python

@ronert_obst
Intro to PL/Python
Ÿ  Procedural languages need to be installed on each database used
Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be super user to install
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper
CREATE
FUNCTION
pymax
(a
integer,
b
integer)

RETURNS
integer

AS
$$

if
a
>
b:

return
a

return
b

$$
LANGUAGE
plpythonu;

SQL wrapper
SQL wrapper
Normal Python

@ronert_obst
Returning Results
Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ  Composite types can be returned by creating a composite type in the database:

CREATE
TYPE
named_value
AS
(

name

text,

value

integer);

Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:

CREATE
FUNCTION
make_pair
(name
text,
value
integer)

RETURNS
named_value

AS
$$

return
[
name,
value
]

#
or
alternatively,
as
tuple:
return
(
name,
value
)

#
or
as
dict:
return
{
"name":
name,
"value":
value
}

#
or
as
an
object
with
attributes
.name
and
.value

$$
LANGUAGE
plpythonu;

Ÿ  For functions which return multiple rows, prefix “setof” before the return type

@ronert_obst
Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:
CREATE
FUNCTION
make_pair
(name
text)

RETURNS
SETOF
named_value

AS
$$

return
([
name,
1
],
[
name,
2
],
[
name,
3])

$$
LANGUAGE
plpythonu;

Sequence
Generator
CREATE
FUNCTION
make_pair
(name
text)

RETURNS
SETOF
named_value

AS
$$

for
i
in
range(3):

yield
(name,
i)

$$
LANGUAGE
plpythonu;

@ronert_obst
Accessing Packages
Ÿ  On Greenplum DB: To be available packages must be installed on the
individual segment nodes.
–  Can use “parallel ssh” tool gpssh to conda/pip install
–  Currently Greenplum DB ships with Python 2.6 (!)
Ÿ  Then just import as usual inside function:

CREATE
FUNCTION
make_pair
(name
text)

RETURNS
named_value

AS
$$

import
numpy
as
np

return
((name,i)
for
i
in
np.arange(3))

$$
LANGUAGE
plpythonu;

@ronert_obst
MPP Architectural Overview
Think of it as multiple
PostgreSQL servers
Workers
Master

@ronert_obst
Benefits of PL/Python
Ÿ  Easy to bring your code to the data
Ÿ  When SQL falls short, leverage your Python experience
Ÿ  Apply Python across petabytes of data with minimal
overhead or additional requirements
Ÿ  Results are already in the database system, ready for further
analysis or storage

@ronert_obst
MADlib

@ronert_obst
Going Beyond Data Parallelism
Ÿ  PL/Python only allows us to run ‘n’ models in parallel
Ÿ  No global model in PL/Python. Solution:
•  Open Source!
https://github.com/madlib/madlib
•  Works on Greenplum DB,
PostgreSQL and HAWQ
•  Active development by Pivotal
-  Latest Release: v1.6 (July 2014)
•  Downloads and Docs:
http://madlib.net/

@ronert_obst
MADlib In-Database
Functions
Predictive Modeling Library
Linear Systems
•  Sparse and Dense Solvers
Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low-Rank
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis, Market
Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Descriptive Statistics
Sketch-based Estimators
•  CountMin (Cormode-
Muthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

@ronert_obst
Architecture
RDBMS Query Processing
(Greenplum, PostgreSQL, HAWQ…)
Low-level Abstraction Layer
(matrix operations, C++)
RDBMS
Built-In
Functions
User Interface
High-level Abstraction Layer
(iteration controller, ...)
Functions for Inner Loops
(for streaming algorithms)
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
Python
Python with templated
SQL
SQL, generated from
specification
C++

@ronert_obst
Examples

@ronert_obst
In Database Image Processing
Ÿ  Many use cases
–  Defect detection in
manufacturing
–  Tumor detection in medical
images
Ÿ  Challenge:
Size of main
memory
Beck et al. Sci Transl Med 2011.
Name Row Col R G B
img.jpg 331 188 250 249 255
img.jpg 332 188 248 250 255
img.jpg 331 189 249 249 255

@ronert_obst
Image Smoothing
create
or
replace
function
smooth(maxr
integer,
maxc
integer,
im_array
integer[])

returns
integer[]

as
$$

import
numpy
as
np

import
scipy
as
sc

import
scipy.ndimage
as
ndi

smooth
=
np.reshape(im_array,
(maxr+1,
maxc+1))

smooth
=
ndi.uniform_filter(smooth,
size=3)

return
np.reshape(smooth.astype(int),
(maxr+1)*(maxc+1))

$$
language
plpythonu;

select
im_id,
smooth(max(row),
max(col),

array_agg(blue_intensity
order
by
row,col))

from
(select
im_id,row,col,blue_intensity

from
images_table
where
im_id=1878

order
by
im_id,
row,
col)
t

group
by
im_id
distributed
by
(im_id);

@ronert_obst
Pivotal GNIP Decahose Pipeline
Parallel Parsing
of JSON
(PL/Python)
Topic Analysis
through MADlib LDA
Unsupervised
Sentiment Analysis
(PL/Python)
D3.js
Twitter Decahose
(~55 million tweets/day)
Source: http
Sink: hdfs
HDFS
External
Tables
PXF
Nightly Cron Jobs

@ronert_obst
Sentiment Analysis – PL/Python Functions
Sentiment Scored
Tweets
Use learned phrasal
polarities to score
sentiment of new tweets
1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)
Part-of-speech
tagger1
Break-up Tweets into
tokens and tag their
parts-of-speech
Phrase Extraction
Semi-Supervised Sentiment Classification
Phrasal Polarity
Scoring
PL/Python

@ronert_obst
Transport for London
Traffic Disruption feed
Pivotal Greenplum
Database Deduplication Feature Creation
Modelling & Machine Learning
d3.js
&
NVD3

Interactive SVG figures
Transport Disruption Prediction Pipeline

@ronert_obst
Get in touch
Feel free to contact me about PL/Python, or more generally
about Data Science.
@ronert_obst
robst@gopivotal.com
www.ronert-obst.com

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Berlin 2014

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Massively Parallel Processing with Procedural Python by Ronert Obst PyData Berlin 2014

Ähnlich wie Massively Parallel Processing with Procedural Python by Ronert Obst PyData Berlin 2014 (20)

Mehr von PyData

Mehr von PyData (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Berlin 2014