Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

BUILT FOR THE SPEED OF BUSINESS

@ianhuston
2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved.
Massively Parallel Processing
with Procedural Languages
PL/R and PL/Python
Ian Huston, @ianhuston
Data Scientist, Pivotal

@ianhuston
3© Copyright 2014 Pivotal. All rights reserved.
Some Links for this talk
Ÿ  Simple code examples:
https://github.com/ihuston/plpython_examples
Ÿ  IPython notebook rendered with nbviewer:
http://tinyurl.com/ih-plpython
Ÿ  More info (especially for PL/R):
http://gopivotal.github.io/gp-r/

@ianhuston
The Pivotal Message
Agile: data driven apps and
rapid time to value
Data Lake: store everything,
analyze anything
Enterprise PaaS: revolutionary
software development and
speed; build the right thing

@ianhuston
Open Source is Pivotal

@ianhuston
Introducing Pivotal Data Labs
Our Charter:
Pivotal Data Labs is Pivotal’s differentiated and highly opinionated data-centric
service delivery organization.
Our Goals:
Expedite customer time-to-value and ROI, by driving business-aligned innovation
and solutions assurance within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full spectrum of Pivotal Data
technologies through best-in-class data science and data engineering services,
with a deep emphasis on knowledge transfer.

@ianhuston
Procedural Languages

@ianhuston
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Workers
Master

@ianhuston
User-Defined Functions (UDFs)
Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Ÿ  Simple UDFs are SQL queries with calling arguments and return types.
Definition:
CREATE
FUNCTION
times2(INT)

RETURNS
INT

AS
$$

SELECT
2
*
$1

$$
LANGUAGE
sql;

Execution:
SELECT
times2(1);

times2

-‐-‐-‐-‐-‐-‐-‐-‐

2

(1
row)

@ianhuston
Ÿ  The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•  Data Parallelism:
-  PL/X piggybacks on
Greenplum’s MPP architecture
•  Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

@ianhuston
Data Parallelism
Ÿ  Little or no effort is required to break up the problem into a
number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
Ÿ  Examples:
–  Measure the height of each student in a classroom (explicitly
parallelizable by student)
–  MapReduce
–  Individual machine learning models per customer/region/etc.

@ianhuston
Intro to PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Syntax is like normal Python function inside a SQL wrapper
Ÿ  Need to specify return type
CREATE
FUNCTION
pymax
(a
integer,
b
integer)

RETURNS
integer

AS
$$

if
a
>
b:

return
a

return
b

$$
LANGUAGE
plpythonu;

SQL wrapper
SQL wrapper
Normal Python

@ianhuston
Returning Results
Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ  Composite types can be returned by creating a composite type in the database:

CREATE
TYPE
named_value
AS
(

name

text,

value

integer

);

Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE
FUNCTION
make_pair
(name
text,
value
integer)

RETURNS
named_value

AS
$$

return
[
name,
value
]

#
or
alternatively,
as
tuple:
return
(
name,
value
)

#
or
as
dict:
return
{
"name":
name,
"value":
value
}

#
or
as
an
object
with
attributes
.name
and
.value

$$
LANGUAGE
plpythonu;

Ÿ  For functions which return multiple rows, prefix “setof” before the return type

@ianhuston
Parallelisation
Separate data into units using GROUP BY
Aggregate input for function using array_agg
SELECT
name
, plp.linregr(array_agg(x), array_agg(y))
FROM plp.test_data
GROUP BY name ORDER BY name;

@ianhuston
Benefits of Procedural Languages
Ÿ  Easy to bring your code to the data.
Ÿ  When SQL falls short leverage your Python/R/Java/C
experience quickly.
Ÿ  Apply Python/R packages across terabytes of data with
minimal overhead or additional requirements.
Ÿ  Results are already in the database system, ready for further
analysis or storage.

@ianhuston
Text Analytics for Churn Prediction
Customer
A major telecom company
Business Problem
Reducing churn through more accurate
models
Challenges
Ÿ  Existing models only used structured
features
Ÿ  Call center memos had poor structure and
had lots of typos
Solution
Ÿ  Built sentiment analysis models to predict
churn and topic models to understand topics
of conversation in call center memos
Ÿ  Achieved 16% improvement in ROC curve
for Churn Prediction

@ianhuston
Predicting Customs Clearance Intercepts
Customer
A major courier delivery services company
Business Problem
7% of shipments into the US are intercepted
by Customs based on package description.
Challenge
Ÿ  Data Volume: Previously not able to build
models with rich features for billions of
packages.
Ÿ  Data Variety: Customer’s prior technology
did not allow them to extract features from
unstructured shipment documents.
Solution
•  Processed shipment description documents
using PL/Python. Tokenized, stemmed, removed
stop words, etc.
•  Built features to summarize the shippers’ prior
shipment and interception history
•  Built a model which predicts the propensity of a
package being intercepted (AUC =0.86)
Intercepted Not Intercepted

@ianhuston
Network User Behavior Anomaly Detection
Customer
A major global financial service enterprise
Business Problem
Detect anomalous user behaviour in the
global enterprise computer network
Challenges
10 Billions events in 6 months; 15K+
network devices; No existing SIEM solutions
can model user behavioral resource access
baseline and enable anomaly detection in an
adaptive and scalable architecture.
Solution
An innovative Graph Mining based algorithmic
framework with advanced Machine Learning.
Network topology and temporal behaviors are both
modeled. (Patent pending). Implemented in MPP
and PL/R, enabling parallel model training and
behavior risk scoring. Successfully identified DLP
violating anomalous users.

@ianhuston
MADlib

@ianhuston
MADlib: The Origin
•  First mention of MAD analytics was at VLDB 2009
MAD Skills: New Analysis Practices for Big Data
J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton
(with help from: Noelle Sio, David Hubbard, James Marca)
http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf
•  MADlib project initiated in late 2010:
Greenplum Analytics team and Prof. Joe Hellerstein
UrbanDictionary
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills
•  Open Source!
https://github.com/madlib/madlib
•  Works on Greenplum DB,
PostgreSQL and also HAWQ &
Impala
•  Active development by Pivotal
-  Latest Release: v1.5 (Apr 2014)
•  Downloads and Docs:
http://madlib.net/

@ianhuston
MADlib In-Database
Functions
Predictive Modeling Library
Linear Systems
•  Sparse and Dense Solvers
Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low-Rank
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis, Market
Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Descriptive Statistics
Sketch-based Estimators
•  CountMin (Cormode-
Muthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

@ianhuston
Get in touch
Feel free to contact me about PL/Python & PL/R, or more
generally about Data Science and opportunities available.
@ianhuston
ihuston @ gopivotal.com
http://www.ianhuston.net

Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

Similar to Data Science Amsterdam - Massively Parallel Processing with Procedural Languages (20)

More from Ian Huston

More from Ian Huston (10)

Recently uploaded

Recently uploaded (20)

Data Science Amsterdam - Massively Parallel Processing with Procedural Languages