SlideShare a Scribd company logo
1 of 23
Download to read offline
BUILT FOR THE SPEED OF BUSINESS
@ianhuston
2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved.
Massively Parallel Processing
with Procedural Languages
PL/R and PL/Python
Ian Huston, @ianhuston
Data Scientist, Pivotal
@ianhuston
3© Copyright 2014 Pivotal. All rights reserved.
Some Links for this talk
Ÿ  Simple code examples:
https://github.com/ihuston/plpython_examples
Ÿ  IPython notebook rendered with nbviewer:
http://tinyurl.com/ih-plpython
Ÿ  More info (especially for PL/R):
http://gopivotal.github.io/gp-r/
@ianhuston
4© Copyright 2014 Pivotal. All rights reserved.
The Pivotal Message
Agile: data driven apps and
rapid time to value
Data Lake: store everything,
analyze anything
Enterprise PaaS: revolutionary
software development and
speed; build the right thing
@ianhuston
5© Copyright 2014 Pivotal. All rights reserved.
Open Source is Pivotal
@ianhuston
6© Copyright 2014 Pivotal. All rights reserved.
Introducing Pivotal Data Labs
Our Charter:
Pivotal Data Labs is Pivotal’s differentiated and highly opinionated data-centric
service delivery organization.
Our Goals:
Expedite customer time-to-value and ROI, by driving business-aligned innovation
and solutions assurance within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full spectrum of Pivotal Data
technologies through best-in-class data science and data engineering services,
with a deep emphasis on knowledge transfer.
@ianhuston
7© Copyright 2014 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2014 Pivotal. All rights reserved.
Procedural Languages
@ianhuston
8© Copyright 2014 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Workers
Master
@ianhuston
9© Copyright 2014 Pivotal. All rights reserved.
User-Defined Functions (UDFs)
Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Ÿ  Simple UDFs are SQL queries with calling arguments and return types.
Definition:
CREATE	
  FUNCTION	
  times2(INT)	
  
RETURNS	
  INT	
  
AS	
  $$	
  
	
  	
  	
  	
  SELECT	
  2	
  *	
  $1	
  
$$	
  LANGUAGE	
  sql;	
  
Execution:
SELECT	
  times2(1);	
  
	
  times2	
  	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  	
  	
  	
  	
  	
  2	
  
(1	
  row)	
  
@ianhuston
10© Copyright 2014 Pivotal. All rights reserved.
Ÿ  The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•  Data Parallelism:
-  PL/X piggybacks on
Greenplum’s MPP architecture
•  Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
@ianhuston
11© Copyright 2014 Pivotal. All rights reserved.
Data Parallelism
Ÿ  Little or no effort is required to break up the problem into a
number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
Ÿ  Examples:
–  Measure the height of each student in a classroom (explicitly
parallelizable by student)
–  MapReduce
–  Individual machine learning models per customer/region/etc.
@ianhuston
12© Copyright 2014 Pivotal. All rights reserved.
Intro to PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Syntax is like normal Python function inside a SQL wrapper
Ÿ  Need to specify return type
CREATE	
  FUNCTION	
  pymax	
  (a	
  integer,	
  b	
  integer)	
  
	
  	
  RETURNS	
  integer	
  
AS	
  $$	
  
	
  	
  if	
  a	
  >	
  b:	
  
	
  	
  	
  	
  return	
  a	
  
	
  	
  return	
  b	
  
$$	
  LANGUAGE	
  plpythonu;	
  
SQL wrapper
SQL wrapper
Normal Python
@ianhuston
13© Copyright 2014 Pivotal. All rights reserved.
Returning Results
Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ  Composite types can be returned by creating a composite type in the database:	
  
CREATE	
  TYPE	
  named_value	
  AS	
  (	
  
	
  	
  name	
  	
  text,	
  
	
  	
  value	
  	
  integer	
  
);	
  
Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text,	
  value	
  integer)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  [	
  name,	
  value	
  ]	
  
	
  	
  #	
  or	
  alternatively,	
  as	
  tuple:	
  return	
  (	
  name,	
  value	
  )	
  
	
  	
  #	
  or	
  as	
  dict:	
  return	
  {	
  "name":	
  name,	
  "value":	
  value	
  }	
  
	
  	
  #	
  or	
  as	
  an	
  object	
  with	
  attributes	
  .name	
  and	
  .value	
  
$$	
  LANGUAGE	
  plpythonu;	
  
Ÿ  For functions which return multiple rows, prefix “setof” before the return type
@ianhuston
14© Copyright 2014 Pivotal. All rights reserved.
Parallelisation
Separate data into units using GROUP BY
Aggregate input for function using array_agg
SELECT
name
, plp.linregr(array_agg(x), array_agg(y))
FROM plp.test_data
GROUP BY name ORDER BY name;
@ianhuston
15© Copyright 2014 Pivotal. All rights reserved.
Benefits of Procedural Languages
Ÿ  Easy to bring your code to the data.
Ÿ  When SQL falls short leverage your Python/R/Java/C
experience quickly.
Ÿ  Apply Python/R packages across terabytes of data with
minimal overhead or additional requirements.
Ÿ  Results are already in the database system, ready for further
analysis or storage.
@ianhuston
16© Copyright 2014 Pivotal. All rights reserved.
Text Analytics for Churn Prediction
Customer
A major telecom company
Business Problem
Reducing churn through more accurate
models
Challenges
Ÿ  Existing models only used structured
features
Ÿ  Call center memos had poor structure and
had lots of typos
Solution
Ÿ  Built sentiment analysis models to predict
churn and topic models to understand topics
of conversation in call center memos
Ÿ  Achieved 16% improvement in ROC curve
for Churn Prediction
@ianhuston
17© Copyright 2014 Pivotal. All rights reserved.
Predicting Customs Clearance Intercepts
Customer
A major courier delivery services company
Business Problem
7% of shipments into the US are intercepted
by Customs based on package description.
Challenge
Ÿ  Data Volume: Previously not able to build
models with rich features for billions of
packages.
Ÿ  Data Variety: Customer’s prior technology
did not allow them to extract features from
unstructured shipment documents.
Solution
•  Processed shipment description documents
using PL/Python. Tokenized, stemmed, removed
stop words, etc.
•  Built features to summarize the shippers’ prior
shipment and interception history
•  Built a model which predicts the propensity of a
package being intercepted (AUC =0.86)
Intercepted Not Intercepted
@ianhuston
18© Copyright 2014 Pivotal. All rights reserved.
Network User Behavior Anomaly Detection
Customer
A major global financial service enterprise
Business Problem
Detect anomalous user behaviour in the
global enterprise computer network
Challenges
10 Billions events in 6 months; 15K+
network devices; No existing SIEM solutions
can model user behavioral resource access
baseline and enable anomaly detection in an
adaptive and scalable architecture.
Solution
An innovative Graph Mining based algorithmic
framework with advanced Machine Learning.
Network topology and temporal behaviors are both
modeled. (Patent pending). Implemented in MPP
and PL/R, enabling parallel model training and
behavior risk scoring. Successfully identified DLP
violating anomalous users.
@ianhuston
19© Copyright 2014 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2014 Pivotal. All rights reserved.
MADlib
@ianhuston
20© Copyright 2014 Pivotal. All rights reserved.
MADlib: The Origin
•  First mention of MAD analytics was at VLDB 2009
MAD Skills: New Analysis Practices for Big Data
J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton
(with help from: Noelle Sio, David Hubbard, James Marca)
http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf
•  MADlib project initiated in late 2010:
Greenplum Analytics team and Prof. Joe Hellerstein
UrbanDictionary
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills
•  Open Source!
https://github.com/madlib/madlib
•  Works on Greenplum DB,
PostgreSQL and also HAWQ &
Impala
•  Active development by Pivotal
-  Latest Release: v1.5 (Apr 2014)
•  Downloads and Docs:
http://madlib.net/
@ianhuston
21© Copyright 2014 Pivotal. All rights reserved.
MADlib In-Database
Functions
Predictive Modeling Library
Linear Systems
•  Sparse and Dense Solvers
Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low-Rank
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis, Market
Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Descriptive Statistics
Sketch-based Estimators
•  CountMin (Cormode-
Muthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
@ianhuston
22© Copyright 2014 Pivotal. All rights reserved.
Get in touch
Feel free to contact me about PL/Python & PL/R, or more
generally about Data Science and opportunities available.
@ianhuston
ihuston @ gopivotal.com
http://www.ianhuston.net
BUILT FOR THE SPEED OF BUSINESS

More Related Content

What's hot

Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on rAshraf Uddin
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overviewandrewcollette
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsEdwin de Jonge
 
Python and R for quantitative finance
Python and R for quantitative financePython and R for quantitative finance
Python and R for quantitative financeLuca Sbardella
 
Twinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolTwinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolLeigh Dodds
 
MapReduce and Its Discontents
MapReduce and Its DiscontentsMapReduce and Its Discontents
MapReduce and Its DiscontentsDean Wampler
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData StackPeadar Coyle
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
Linked Data and Services
Linked Data and ServicesLinked Data and Services
Linked Data and ServicesBarry Norton
 

What's hot (20)

Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
R tutorial
R tutorialR tutorial
R tutorial
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overview
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
Python and R for quantitative finance
Python and R for quantitative financePython and R for quantitative finance
Python and R for quantitative finance
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Twinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolTwinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query Tool
 
MapReduce and Its Discontents
MapReduce and Its DiscontentsMapReduce and Its Discontents
MapReduce and Its Discontents
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Linked Data and Services
Linked Data and ServicesLinked Data and Services
Linked Data and Services
 

Similar to Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonPyData
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeInside Analysis
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemallMakoto Yui
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Makoto Yui
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...huguk
 

Similar to Data Science Amsterdam - Massively Parallel Processing with Procedural Languages (20)

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality Challenge
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
R_L1-Aug-2022.pptx
R_L1-Aug-2022.pptxR_L1-Aug-2022.pptx
R_L1-Aug-2022.pptx
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
 

More from Ian Huston

CFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud FoundryCFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud FoundryIan Huston
 
Cloud Foundry for Data Science
Cloud Foundry for Data ScienceCloud Foundry for Data Science
Cloud Foundry for Data ScienceIan Huston
 
Python on Cloud Foundry
Python on Cloud FoundryPython on Cloud Foundry
Python on Cloud FoundryIan Huston
 
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...Ian Huston
 
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field InflationCalculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field InflationIan Huston
 
Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011Ian Huston
 
Second Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollSecond Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollIan Huston
 
Inflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big BangInflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big BangIan Huston
 
Cosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical SimulationsCosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical SimulationsIan Huston
 
Cosmo09 presentation
Cosmo09 presentationCosmo09 presentation
Cosmo09 presentationIan Huston
 

More from Ian Huston (10)

CFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud FoundryCFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud Foundry
 
Cloud Foundry for Data Science
Cloud Foundry for Data ScienceCloud Foundry for Data Science
Cloud Foundry for Data Science
 
Python on Cloud Foundry
Python on Cloud FoundryPython on Cloud Foundry
Python on Cloud Foundry
 
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
 
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field InflationCalculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
 
Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011
 
Second Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollSecond Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-roll
 
Inflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big BangInflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big Bang
 
Cosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical SimulationsCosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical Simulations
 
Cosmo09 presentation
Cosmo09 presentationCosmo09 presentation
Cosmo09 presentation
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

  • 1. BUILT FOR THE SPEED OF BUSINESS
  • 2. @ianhuston 2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved. Massively Parallel Processing with Procedural Languages PL/R and PL/Python Ian Huston, @ianhuston Data Scientist, Pivotal
  • 3. @ianhuston 3© Copyright 2014 Pivotal. All rights reserved. Some Links for this talk Ÿ  Simple code examples: https://github.com/ihuston/plpython_examples Ÿ  IPython notebook rendered with nbviewer: http://tinyurl.com/ih-plpython Ÿ  More info (especially for PL/R): http://gopivotal.github.io/gp-r/
  • 4. @ianhuston 4© Copyright 2014 Pivotal. All rights reserved. The Pivotal Message Agile: data driven apps and rapid time to value Data Lake: store everything, analyze anything Enterprise PaaS: revolutionary software development and speed; build the right thing
  • 5. @ianhuston 5© Copyright 2014 Pivotal. All rights reserved. Open Source is Pivotal
  • 6. @ianhuston 6© Copyright 2014 Pivotal. All rights reserved. Introducing Pivotal Data Labs Our Charter: Pivotal Data Labs is Pivotal’s differentiated and highly opinionated data-centric service delivery organization. Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies. Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in-class data science and data engineering services, with a deep emphasis on knowledge transfer.
  • 7. @ianhuston 7© Copyright 2014 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2014 Pivotal. All rights reserved. Procedural Languages
  • 8. @ianhuston 8© Copyright 2014 Pivotal. All rights reserved. MPP Architectural Overview Think of it as multiple PostGreSQL servers Workers Master
  • 9. @ianhuston 9© Copyright 2014 Pivotal. All rights reserved. User-Defined Functions (UDFs) Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Ÿ  Simple UDFs are SQL queries with calling arguments and return types. Definition: CREATE  FUNCTION  times2(INT)   RETURNS  INT   AS  $$          SELECT  2  *  $1   $$  LANGUAGE  sql;   Execution: SELECT  times2(1);    times2     -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐              2   (1  row)  
  • 10. @ianhuston 10© Copyright 2014 Pivotal. All rights reserved. Ÿ  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster •  Data Parallelism: -  PL/X piggybacks on Greenplum’s MPP architecture •  Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages Standby Master … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
  • 11. @ianhuston 11© Copyright 2014 Pivotal. All rights reserved. Data Parallelism Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  Individual machine learning models per customer/region/etc.
  • 12. @ianhuston 12© Copyright 2014 Pivotal. All rights reserved. Intro to PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Syntax is like normal Python function inside a SQL wrapper Ÿ  Need to specify return type CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;   SQL wrapper SQL wrapper Normal Python
  • 13. @ianhuston 13© Copyright 2014 Pivotal. All rights reserved. Returning Results Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ  Composite types can be returned by creating a composite type in the database:   CREATE  TYPE  named_value  AS  (      name    text,      value    integer   );   Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value   AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value   $$  LANGUAGE  plpythonu;   Ÿ  For functions which return multiple rows, prefix “setof” before the return type
  • 14. @ianhuston 14© Copyright 2014 Pivotal. All rights reserved. Parallelisation Separate data into units using GROUP BY Aggregate input for function using array_agg SELECT name , plp.linregr(array_agg(x), array_agg(y)) FROM plp.test_data GROUP BY name ORDER BY name;
  • 15. @ianhuston 15© Copyright 2014 Pivotal. All rights reserved. Benefits of Procedural Languages Ÿ  Easy to bring your code to the data. Ÿ  When SQL falls short leverage your Python/R/Java/C experience quickly. Ÿ  Apply Python/R packages across terabytes of data with minimal overhead or additional requirements. Ÿ  Results are already in the database system, ready for further analysis or storage.
  • 16. @ianhuston 16© Copyright 2014 Pivotal. All rights reserved. Text Analytics for Churn Prediction Customer A major telecom company Business Problem Reducing churn through more accurate models Challenges Ÿ  Existing models only used structured features Ÿ  Call center memos had poor structure and had lots of typos Solution Ÿ  Built sentiment analysis models to predict churn and topic models to understand topics of conversation in call center memos Ÿ  Achieved 16% improvement in ROC curve for Churn Prediction
  • 17. @ianhuston 17© Copyright 2014 Pivotal. All rights reserved. Predicting Customs Clearance Intercepts Customer A major courier delivery services company Business Problem 7% of shipments into the US are intercepted by Customs based on package description. Challenge Ÿ  Data Volume: Previously not able to build models with rich features for billions of packages. Ÿ  Data Variety: Customer’s prior technology did not allow them to extract features from unstructured shipment documents. Solution •  Processed shipment description documents using PL/Python. Tokenized, stemmed, removed stop words, etc. •  Built features to summarize the shippers’ prior shipment and interception history •  Built a model which predicts the propensity of a package being intercepted (AUC =0.86) Intercepted Not Intercepted
  • 18. @ianhuston 18© Copyright 2014 Pivotal. All rights reserved. Network User Behavior Anomaly Detection Customer A major global financial service enterprise Business Problem Detect anomalous user behaviour in the global enterprise computer network Challenges 10 Billions events in 6 months; 15K+ network devices; No existing SIEM solutions can model user behavioral resource access baseline and enable anomaly detection in an adaptive and scalable architecture. Solution An innovative Graph Mining based algorithmic framework with advanced Machine Learning. Network topology and temporal behaviors are both modeled. (Patent pending). Implemented in MPP and PL/R, enabling parallel model training and behavior risk scoring. Successfully identified DLP violating anomalous users.
  • 19. @ianhuston 19© Copyright 2014 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2014 Pivotal. All rights reserved. MADlib
  • 20. @ianhuston 20© Copyright 2014 Pivotal. All rights reserved. MADlib: The Origin •  First mention of MAD analytics was at VLDB 2009 MAD Skills: New Analysis Practices for Big Data J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton (with help from: Noelle Sio, David Hubbard, James Marca) http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf •  MADlib project initiated in late 2010: Greenplum Analytics team and Prof. Joe Hellerstein UrbanDictionary mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills •  Open Source! https://github.com/madlib/madlib •  Works on Greenplum DB, PostgreSQL and also HAWQ & Impala •  Active development by Pivotal -  Latest Release: v1.5 (Apr 2014) •  Downloads and Docs: http://madlib.net/
  • 21. @ianhuston 21© Copyright 2014 Pivotal. All rights reserved. MADlib In-Database Functions Predictive Modeling Library Linear Systems •  Sparse and Dense Solvers Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Descriptive Statistics Sketch-based Estimators •  CountMin (Cormode- Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions
  • 22. @ianhuston 22© Copyright 2014 Pivotal. All rights reserved. Get in touch Feel free to contact me about PL/Python & PL/R, or more generally about Data Science and opportunities available. @ianhuston ihuston @ gopivotal.com http://www.ianhuston.net
  • 23. BUILT FOR THE SPEED OF BUSINESS