Submit Search
Upload
Python Powered Data Science at Pivotal (PyData 2013)
•
16 likes
•
8,509 views
Srivatsan Ramanujam
Follow
How we do data science at Pivotal with some of the cool Python tools.
Read less
Read more
Technology
Report
Share
Report
Share
1 of 30
Download now
Download to read offline
Recommended
All thingspython@pivotal
All thingspython@pivotal
Srivatsan Ramanujam
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Srivatsan Ramanujam
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Srivatsan Ramanujam
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity Futures
Srivatsan Ramanujam
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
EMC
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
go-pivotal
Recommended
All thingspython@pivotal
All thingspython@pivotal
Srivatsan Ramanujam
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Srivatsan Ramanujam
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Srivatsan Ramanujam
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity Futures
Srivatsan Ramanujam
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
EMC
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
go-pivotal
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PivotalOpenSourceHub
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
Pietro Michiardi
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
Jim Dowling
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
Summary machine learning and model deployment
Summary machine learning and model deployment
Novita Sari
Introduction to Mahout
Introduction to Mahout
Ted Dunning
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
Machine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
Avery Ching
Machine Learning with Hadoop
Machine Learning with Hadoop
Sangchul Song
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
PyData
Pig Experience
Pig Experience
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
Jim Dowling
Map Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
The MADlib Analytics Library
The MADlib Analytics Library
EMC
A sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Srivatsan Ramanujam
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Vivian S. Zhang
More Related Content
What's hot
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PivotalOpenSourceHub
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
Pietro Michiardi
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
Jim Dowling
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
Summary machine learning and model deployment
Summary machine learning and model deployment
Novita Sari
Introduction to Mahout
Introduction to Mahout
Ted Dunning
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
Machine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
Avery Ching
Machine Learning with Hadoop
Machine Learning with Hadoop
Sangchul Song
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
PyData
Pig Experience
Pig Experience
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
Jim Dowling
Map Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
The MADlib Analytics Library
The MADlib Analytics Library
EMC
A sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
What's hot
(20)
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Summary machine learning and model deployment
Summary machine learning and model deployment
Introduction to Mahout
Introduction to Mahout
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
Machine Learning and Hadoop
Machine Learning and Hadoop
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
Machine Learning with Hadoop
Machine Learning with Hadoop
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
Pig Experience
Pig Experience
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
Map Reduce introduction
Map Reduce introduction
The MADlib Analytics Library
The MADlib Analytics Library
A sql implementation on the map reduce framework
A sql implementation on the map reduce framework
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Viewers also liked
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Srivatsan Ramanujam
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Vivian S. Zhang
Germin8 - Social Media Analytics
Germin8 - Social Media Analytics
Germin8
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
viirya
R server and spark
R server and spark
BAINIDA
microsoft r server for distributed computing
microsoft r server for distributed computing
BAINIDA
Greenplum Database Overview
Greenplum Database Overview
EMC
Viewers also liked
(7)
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Germin8 - Social Media Analytics
Germin8 - Social Media Analytics
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
R server and spark
R server and spark
microsoft r server for distributed computing
microsoft r server for distributed computing
Greenplum Database Overview
Greenplum Database Overview
Similar to Python Powered Data Science at Pivotal (PyData 2013)
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
PyData
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Ian Huston
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
Introduction To Python
Introduction To Python
Vanessa Rene
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
Sql saturday pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
Unit V.pdf
Unit V.pdf
KennyPratheepKumar
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-python
Waternomics
Researh toolbox - Data analysis with python
Researh toolbox - Data analysis with python
Umair ul Hassan
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
Rob Vesse
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
sdeeg
Scaling Data Science on Big Data
Scaling Data Science on Big Data
DataWorks Summit
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
Mobcoder
06 pig-01-intro
06 pig-01-intro
Aasim Naveed
Toolboxes for data scientists
Toolboxes for data scientists
Sudipto Krishna Dutta
Similar to Python Powered Data Science at Pivotal (PyData 2013)
(20)
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Introduction To Python
Introduction To Python
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Sql saturday pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Unit V.pdf
Unit V.pdf
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-python
Researh toolbox - Data analysis with python
Researh toolbox - Data analysis with python
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
Scaling Data Science on Big Data
Scaling Data Science on Big Data
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
06 pig-01-intro
06 pig-01-intro
Toolboxes for data scientists
Toolboxes for data scientists
Recently uploaded
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Principled Technologies
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
jfdjdjcjdnsjd
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Safe Software
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
The Digital Insurer
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
The Digital Insurer
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Roshan Dwivedi
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Juan lago vázquez
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Remote DBA Services
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
lior mazor
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
debabhi2
Recently uploaded
(20)
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Python Powered Data Science at Pivotal (PyData 2013)
1.
BUILT FOR THE
SPEED OF BUSINESS
2.
Python Powered Data Science
at Pivotal How do we use the PyData stack in real engagements? Srivatsan Ramanujam, @being_bayesian Ian Huston, @ianhuston Data Scientists, Pivotal © Copyright 2013 Pivotal. All rights reserved. 2
3.
Agenda Ÿ Introduction to
Pivotal Ÿ Pivotal Data Science Toolkit Ÿ PL/Python Ÿ In-database Machine Learning with MADlib Ÿ Live Demos – Topic Analysis and Sentiment Analysis Engine – Traffic Disruption Prediction © Copyright 2013 Pivotal. All rights reserved. 3
4.
Pivotal Platform Stack Data-Driven Application Development Pivotal
Data Science Labs Cloud Application Platform Data & Analytics Platform Virtualization Cloud Storage © Copyright 2013 Pivotal. All rights reserved. 4
5.
What do our
customers look like? Ÿ Large enterprises with lots of data collected – Work with 10s of TBs to PBs of data, structured & unstructured Ÿ Not able to get what they want out of their data – Old Legacy systems with high cost and no flexibility – Response times are too slow for interactive data analysis – Can only deal with small samples of data locally Ÿ They want to transform into data driven enterprises © Copyright 2013 Pivotal. All rights reserved. 5
6.
MPP Architectural Overview Think
of it as multiple PostGreSQL servers Master Segments/Workers Rows are distributed across segments by a particular field (or randomly) © Copyright 2013 Pivotal. All rights reserved. 6
7.
Typical Engagement Tech
Setup Ÿ Platform: – Greenplum Analytics Database (GPDB) – Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop) Ÿ Open Source Options (http://gopivotal.com): – Greenplum Community Edition – Pivotal HD Community Edition (HAWQ not included) – MADlib in-database machine learning library (http://madlib.net) Ÿ Where Python fits in: – PL/Python running in-database, with nltk, scikit-learn etc – IPython for exploratory analysis – Pandas, Matplotlib etc. © Copyright 2013 Pivotal. All rights reserved. 7
8.
PIVOTAL DATA SCIENCE TOOLKIT 1 Find
Data Platforms • Greenplum DB • Pivotal HD • Hadoop (other) • SAS HPA • AWS 2 3 Run Code Interfaces • pgAdminIII • psql • psycopg2 • Terminal • Cygwin • Putty • Winscp Write Code Editing Tools • Vi/Vim • Emacs • Smultron • TextWrangler • Eclipse • Notepad++ • IPython • Sublime Languages • SQL • Bash scripting • C • C++ • C# • Java • Python • R © Copyright 2013 Pivotal. All rights reserved. 4 Write Code for Big Data In-Database • SQL • PL/Python • PL/Java • PL/R • PL/pgSQL 5 Hadoop • HAWQ • Pig • Hive • Java 6 Visualization • python-matplotlib • python-networkx • D3.js • Tableau Implement Algorithms Libraries • MADlib Java • Mahout R • (Too many to list!) Text • OpenNLP • NLTK • GPText C++ • opencv Show Results Python • NumPy • SciPy • scikit-learn • Pandas Programs • Alpine Miner • Rstudio • MATLAB • SAS • Stata • GraphViz • Gephi • R (ggplot2, lattice, shiny) • Excel 7 Collaborate Sharing Tools • Chorus • Confluence • Socialcast • Github • Google Drive & Hangouts A large and varied tool box! 8
9.
Data Parallelism Ÿ Little
or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ Examples: – Measure the height of each student in a classroom (explicitly parallelizable by student) – MapReduce – map() function in Python © Copyright 2013 Pivotal. All rights reserved. 9
10.
User-Defined Functions (UDFs) Ÿ
PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Ÿ Simple UDFs are SQL queries with calling arguments and return types. Definition: Execution: CREATE FUNCTION times2(INT) RETURNS INT AS $$ SELECT 2 * $1 $$ LANGUAGE sql; SELECT times2(1); times2 -‐-‐-‐-‐-‐-‐-‐-‐ 2 (1 row) © Copyright 2013 Pivotal. All rights reserved. 10
11.
PL/X : X
in {pgsql, R, Python, Java, Perl, C etc.} • Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages SQL Master Host Ÿ The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster • Data Parallelism: - PL/X piggybacks on Greenplum’s MPP architecture © Copyright 2013 Pivotal. All rights reserved. Standby Master Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment … 11
12.
Intro to PL/Python Ÿ
Procedural languages need to be installed on each database used. Ÿ Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install. Ÿ Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. SQL wrapper Normal Python SQL wrapper © Copyright 2013 Pivotal. All rights reserved. CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu; 12
13.
Returning Results Ÿ Postgres
primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ Composite types can be returned by creating a composite type in the database: CREATE TYPE named_value AS ( name text, value integer ); Ÿ Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE FUNCTION make_pair (name text, value integer) RETURNS named_value AS $$ return [ name, value ] # or alternatively, as tuple: return ( name, value ) # or as dict: return { "name": name, "value": value } # or as an object with attributes .name and .value $$ LANGUAGE plpythonu; Ÿ For functions which return multiple rows, prefix “setof” before the return type © Copyright 2013 Pivotal. All rights reserved. 13
14.
Returning more results You
can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: Sequence Generator © Copyright 2013 Pivotal. All rights reserved. CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ return ([ name, 1 ], [ name, 2 ], [ name, 3]) $$ LANGUAGE plpythonu; CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ for i in range(3): yield (name, i) $$ LANGUAGE plpythonu; 14
15.
Accessing Packages Ÿ On
Greenplum DB: To be available packages must be installed on the individual segment nodes. – Can use “parallel ssh” tool gpssh to conda/pip install – Currently Greenplum DB ships with Python 2.6 (!) Ÿ Then just import as usual inside function: CREATE FUNCTION make_pair (name text) RETURNS named_value AS $$ import numpy as np return ((name,i) for i in np.arange(3)) $$ LANGUAGE plpythonu; © Copyright 2013 Pivotal. All rights reserved. 15
16.
Benefits of PL/Python Ÿ
Easy to bring your code to the data. Ÿ When SQL falls short leverage your Python (or R/Java/C) experience quickly. Ÿ Apply Python across terabytes of data with minimal overhead or additional requirements. Ÿ Results are already in the database system, ready for further analysis or storage. © Copyright 2013 Pivotal. All rights reserved. 16
17.
Going Beyond Data
Parallelism Ÿ Data Parallel computation via PL/Python libraries only allow us to run ‘n’ models in parallel. Ÿ This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data Ÿ For this, we use MADlib – an open source library of parallel in-database machine learning algorithms. © Copyright 2013 Pivotal. All rights reserved. 17
18.
MADlib: The Origin UrbanDictionary mad
(adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills • First mention of MAD analytics was at VLDB 2009 MAD Skills: New Analysis Practices for Big Data J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton (with help from: Noelle Sio, David Hubbard, James Marca) http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf • MADlib project initiated in late 2010: Greenplum Analytics team and Prof. Joe Hellerstein © Copyright 2013 Pivotal. All rights reserved. • Open Source! https://github.com/madlib/madlib • • Works on Greenplum DB and PostgreSQL Active development by Pivotal - • Latest Release : v1.3 (Oct 2013) Downloads and Docs: http://madlib.net/ 18
19.
MADlib Executes Algorithms
In-Place MADlib User MADlib Advantages Master Processor Ø SQL SQL SQL M M M No Data Movement Ø M Use MPP architecture’s full compute power Ø Use MPP architecture’s entire memory to process data sets Segment Processors © Copyright 2013 Pivotal. All rights reserved. 19
20.
MADlib In-Database Functions Descriptive Statistics Predictive
Modeling Library Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank © Copyright 2013 Pivotal. All rights reserved. Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation Linear Systems • Sparse and Dense Solvers Sketch-based Estimators • CountMin (CormodeMuthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions 20
21.
Architecture User Interface “Driver” Functions (outer
loops of iterative algorithms, optimizer invocations) High-level Abstraction Layer (iteration controller, ...) RDBMS Built-in Functions SQL, generated from specification Python with templated SQL Python Functions for Inner Loops (for streaming algorithms) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) C++ RDBMS Query Processing (Greenplum, PostgreSQL, …) © Copyright 2013 Pivotal. All rights reserved. 21
22.
How does it
work ? : A Linear Regression Example Ÿ Finding linear dependencies between variables – y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 Vector of dependent variables y © Copyright 2013 Pivotal. All rights reserved. y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 from unm limit 6; Design Matrix X 22
23.
Reminder: Linear-Regression Model • •
If residuals i.i.d. Gaussians with standard deviation σ: – max likelihood ⇔ min sum of squared residuals f (y | x) ∝ exp − 1 · (y − xT c)2 2σ 2 • First-order conditions for the following quadratic objective (in c) yield the minimizer © Copyright 2013 Pivotal. All rights reserved. 23
24.
Linear Regression: Streaming
Algorithm • How to compute with a single table scan? -1 XT XT X XTX © Copyright 2013 Pivotal. All rights reserved. y XTy 24
25.
Linear Regression: Parallel
Computation XT y Segment 1 T X1 y1 Segment 2 T T X T y = X 1 X2 © Copyright 2013 Pivotal. All rights reserved. Master T X2 y2 y1 y2 = XTy T Xi y i 25
26.
Demos Ÿ We built
demos to showcase our technology pipeline, using Python technology. Ÿ Two use cases: – Topic and Sentiment Analysis of Tweets – London Road Traffic Disruption prediction © Copyright 2013 Pivotal. All rights reserved. 26
27.
Topic and Sentiment
Analysis Pipeline Tweet Stream D3.js Stored on HDFS Topic Analysis through MADlib pLDA (gpfdist) Loaded as external tables into GPDB © Copyright 2013 Pivotal. All rights reserved. Parallel Parsing of JSON and extraction of fields using PL/ Python Sentiment Analysis through custom PL/Python functions 27
28.
Transport Disruption Prediction
Pipeline Transport for London Traffic Disruption feed Pivotal Greenplum Database d3.js & NVD3 Interactive SVG figures © Copyright 2013 Pivotal. All rights reserved. Deduplication Feature Creation Modelling & Machine Learning 28
29.
Pivotal’s Open Source
Contributions http://gopivotal.com/pivotal-products/open-source-software • PyMADlib – Python Wrapper for MADlib - https://github.com/gopivotal/pymadlib • PivotalR – R wrapper for MADlib - https://github.com/madlib-internal/PivotalR • Part-of-speech tagger for Twitter via SQL - http://vatsan.github.io/gp-ark-tweet-nlp/ Questions? @being_bayesian @ianhuston © Copyright 2013 Pivotal. All rights reserved. 29
30.
BUILT FOR THE
SPEED OF BUSINESS
Download now