SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
BUILT FOR THE SPEED OF BUSINESS
@ianhuston
2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved.
Massively Parallel Processing
with Procedural Python
How do we use the PyData stack in data science
engagements at Pivotal?
Ian Huston, @ianhuston
Data Scientist, Pivotal
@ianhuston
3© Copyright 2014 Pivotal. All rights reserved.
Some Links for this talk
Ÿ  Simple code examples:
https://github.com/ihuston/plpython_examples
Ÿ  IPython notebook rendered with nbviewer:
http://tinyurl.com/ih-plpython
Ÿ  More info (written for PL/R but applies to PL/Python):
http://gopivotal.github.io/gp-r/
Ÿ  Traffic Disruption demo (if we have time)
http://ds-demo-transport.cfapps.io
@ianhuston
4© Copyright 2014 Pivotal. All rights reserved.
About Pivotal
Cloud Storage
Virtualization
Data &
Analytics
Platform
Cloud
Application
Platform
Data-Driven
Application
Development
Pivotal Data
Science Labs
@ianhuston
5© Copyright 2014 Pivotal. All rights reserved.
What do our customers look like?
Ÿ  Large enterprises with lots of data collected
–  Work with 10s of TBs to PBs of data, structured & unstructured
Ÿ  Not able to get what they want out of their data
–  Old Legacy systems with high cost and no flexibility
–  Response times are too slow for interactive data analysis
–  Can only deal with small samples of data locally
Ÿ  They want to transform into data driven enterprises
@ianhuston
6© Copyright 2014 Pivotal. All rights reserved.
Open Source is Pivotal
@ianhuston
7© Copyright 2014 Pivotal. All rights reserved.
Pivotal’s Open Source Contributions
Lots more interesting small projects:
•  PyMADlib – Python Wrapper for MADlib
https://github.com/gopivotal/pymadlib
•  PivotalR – R wrapper for MADlib
http://github.com/madlib-internal/PivotalR
•  Part-of-speech tagger for Twitter via SQL
http://vatsan.github.io/gp-ark-tweet-nlp/
•  Pandas via psql
(interactive PostgreSQL terminal)
https://github.com/vatsan/pandas_via_psql
@ianhuston
8© Copyright 2014 Pivotal. All rights reserved.
Typical Engagement Tech Setup
Ÿ  Platform:
–  Greenplum Analytics Database (GPDB)
–  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop)
Ÿ  Open Source Options (http://gopivotal.com):
–  Greenplum Community Edition
–  Pivotal HD Community Edition (HAWQ not included)
–  MADlib in-database machine learning library (http://madlib.net)
Ÿ  Where Python fits in:
–  PL/Python running in-database, with nltk, scikit-learn etc
–  IPython for exploratory analysis
–  Pandas, Matplotlib etc.
@ianhuston
9© Copyright 2014 Pivotal. All rights reserved.
1 Find Data
Platforms
•  Greenplum DB
•  Pivotal HD
•  Hadoop (other)
•  SAS HPA
•  AWS
2 Write Code
Editing Tools
•  Vi/Vim
•  Emacs
•  Smultron
•  TextWrangler
•  Eclipse
•  Notepad++
•  IPython
•  Sublime
Languages
•  SQL
•  Bash scripting
•  C
•  C++
•  C#
•  Java
•  Python
•  R
3 Run Code
Interfaces
•  pgAdminIII
•  psql
•  psycopg2
•  Terminal
•  Cygwin
•  Putty
•  Winscp
4 Write Code for Big Data
In-Database
•  SQL
•  PL/Python
•  PL/Java
•  PL/R
•  PL/pgSQL
Hadoop
•  HAWQ
•  Pig
•  Hive
•  Java
5 Implement Algorithms
Libraries
•  MADlib
Java
•  Mahout
R
•  (Too many to list!)
Text
•  OpenNLP
•  NLTK
•  GPText
C++
•  opencv
Python
•  NumPy
•  SciPy
•  scikit-learn
•  Pandas
Programs
•  Alpine Miner
•  Rstudio
•  MATLAB
•  SAS
•  Stata
6 Show Results
Visualization
•  python-matplotlib
•  python-networkx
•  D3.js
•  Tableau
•  GraphViz
•  Gephi
•  R (ggplot2, lattice,
shiny)
•  Excel
7 Collaborate
Sharing Tools
•  Chorus
•  Confluence
•  Socialcast
•  Github
•  Google Drive &
Hangouts
PIVOTAL DATA SCIENCE
TOOLKIT
A large and
varied tool box!
@ianhuston
10© Copyright 2014 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright 2014 Pivotal. All rights reserved.
PL/Python
@ianhuston
11© Copyright 2014 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Workers
Master
@ianhuston
12© Copyright 2014 Pivotal. All rights reserved.
Data Parallelism
Ÿ  Little or no effort is required to break up the problem into a
number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
Ÿ  Examples:
–  Measure the height of each student in a classroom (explicitly
parallelizable by student)
–  MapReduce
–  map() function in Python
@ianhuston
13© Copyright 2014 Pivotal. All rights reserved.
User-Defined Functions (UDFs)
Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Ÿ  Simple UDFs are SQL queries with calling arguments and return types.
Definition:
CREATE	
  FUNCTION	
  times2(INT)	
  
RETURNS	
  INT	
  
AS	
  $$	
  
	
  	
  	
  	
  SELECT	
  2	
  *	
  $1	
  
$$	
  LANGUAGE	
  sql;	
  
Execution:
SELECT	
  times2(1);	
  
	
  times2	
  	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  	
  	
  	
  	
  	
  2	
  
(1	
  row)	
  
@ianhuston
14© Copyright 2014 Pivotal. All rights reserved.
Ÿ  The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•  Data Parallelism:
-  PL/X piggybacks on
Greenplum’s MPP architecture
•  Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
@ianhuston
15© Copyright 2014 Pivotal. All rights reserved.
Intro to PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install.
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.
CREATE	
  FUNCTION	
  pymax	
  (a	
  integer,	
  b	
  integer)	
  
	
  	
  RETURNS	
  integer	
  
AS	
  $$	
  
	
  	
  if	
  a	
  >	
  b:	
  
	
  	
  	
  	
  return	
  a	
  
	
  	
  return	
  b	
  
$$	
  LANGUAGE	
  plpythonu;	
  
SQL wrapper
SQL wrapper
Normal Python
@ianhuston
16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2014 Pivotal. All rights reserved.
Examples
@ianhuston
17© Copyright 2014 Pivotal. All rights reserved.
Returning Results
Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ  Composite types can be returned by creating a composite type in the database:	
  
CREATE	
  TYPE	
  named_value	
  AS	
  (	
  
	
  	
  name	
  	
  text,	
  
	
  	
  value	
  	
  integer	
  
);	
  
Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text,	
  value	
  integer)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  [	
  name,	
  value	
  ]	
  
	
  	
  #	
  or	
  alternatively,	
  as	
  tuple:	
  return	
  (	
  name,	
  value	
  )	
  
	
  	
  #	
  or	
  as	
  dict:	
  return	
  {	
  "name":	
  name,	
  "value":	
  value	
  }	
  
	
  	
  #	
  or	
  as	
  an	
  object	
  with	
  attributes	
  .name	
  and	
  .value	
  
$$	
  LANGUAGE	
  plpythonu;	
  
Ÿ  For functions which return multiple rows, prefix “setof” before the return type
@ianhuston
18© Copyright 2014 Pivotal. All rights reserved.
Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  SETOF	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  ([	
  name,	
  1	
  ],	
  [	
  name,	
  2	
  ],	
  [	
  name,	
  3])	
  	
  
$$	
  LANGUAGE	
  plpythonu;	
  
Sequence
Generator
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  SETOF	
  named_value	
  	
  AS	
  $$	
  
	
  	
  for	
  i	
  in	
  range(3):	
  
	
  	
  	
  	
  	
  	
  yield	
  (name,	
  i)	
  	
  
$$	
  LANGUAGE	
  plpythonu;	
  
@ianhuston
19© Copyright 2014 Pivotal. All rights reserved.
Accessing Packages
Ÿ  On Greenplum DB: To be available packages must be installed on the
individual segment nodes.
–  Can use “parallel ssh” tool gpssh to conda/pip install
–  Currently Greenplum DB ships with Python 2.6 (!)
Ÿ  Then just import as usual inside function:
	
  	
   CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  import	
  numpy	
  as	
  np	
  
	
  	
  return	
  ((name,i)	
  for	
  i	
  in	
  np.arange(3))	
  
$$	
  LANGUAGE	
  plpythonu;	
  
@ianhuston
20© Copyright 2014 Pivotal. All rights reserved.
Benefits of PL/Python
Ÿ  Easy to bring your code to the data.
Ÿ  When SQL falls short leverage your Python (or R/Java/C)
experience quickly.
Ÿ  Apply Python across terabytes of data with minimal
overhead or additional requirements.
Ÿ  Results are already in the database system, ready for further
analysis or storage.
@ianhuston
21© Copyright 2014 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright 2014 Pivotal. All rights reserved.
MADlib
@ianhuston
22© Copyright 2014 Pivotal. All rights reserved.
Going Beyond Data Parallelism
Ÿ  Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
Ÿ  This works great when we are building one model for each
value of the group by column, but we need parallelized
algorithms to be able to build a single model on all the
available data
Ÿ  For this, we use MADlib – an open source library of parallel
in-database machine learning algorithms.
@ianhuston
23© Copyright 2014 Pivotal. All rights reserved.
MADlib: The Origin
•  First mention of MAD analytics was at VLDB 2009
MAD Skills: New Analysis Practices for Big Data
J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton
(with help from: Noelle Sio, David Hubbard, James Marca)
http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf
•  MADlib project initiated in late 2010:
Greenplum Analytics team and Prof. Joe Hellerstein
UrbanDictionary
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills
•  Open Source!
https://github.com/madlib/madlib
•  Works on Greenplum DB,
PostgreSQL and also HAWQ &
Impala
•  Active development by Pivotal
-  Latest Release: v1.4 (Nov 2013)
•  Downloads and Docs:
http://madlib.net/
@ianhuston
24© Copyright 2014 Pivotal. All rights reserved.
MADlib Executes Algorithms In-Place
MADlib Advantages
Ø  No Data Movement
Ø  Use MPP architecture’s
full compute power
Ø  Use MPP architecture’s
entire memory to
process data sets
SQL
M
SQL
M
SQL
M
Segment
Processors
Master
Processor
M
MADlib User
@ianhuston
25© Copyright 2014 Pivotal. All rights reserved.
MADlib In-Database
Functions
Predictive Modeling Library
Linear Systems
•  Sparse and Dense Solvers
Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low-Rank
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis, Market
Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Descriptive Statistics
Sketch-based Estimators
•  CountMin (Cormode-
Muthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
@ianhuston
26© Copyright 2014 Pivotal. All rights reserved.
Architecture
RDBMS Query Processing
(Greenplum, PostgreSQL, …)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS
type bridge, …)
RDBMS
Built-in
Functions
User Interface
High-level Abstraction Layer
(iteration controller, ...)
Functions for Inner Loops
(for streaming algorithms)
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
Python
Python with
templated SQL
SQL, generated from
specification
C++
@ianhuston
27© Copyright 2014 Pivotal. All rights reserved.
How does it work ? : A Linear Regression Example
Ÿ  Finding linear dependencies between variables
–  y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2 from unm limit 6;
y | x1 | x2
-------+------+-----
10.14 | 0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8
Design Matrix X
Vector of dependent
variables y
@ianhuston
28© Copyright 2014 Pivotal. All rights reserved.
Reminder: Linear-Regression Model
• 
•  If residuals i.i.d. Gaussians with standard deviation σ:
–  max likelihood ⇔ min sum of squared residuals
•  First-order conditions for the following quadratic objective (in c)
yield the minimizer
f(y | x) ∝ exp −
1
2σ2
· (y − xT
c)2
@ianhuston
29© Copyright 2014 Pivotal. All rights reserved.
Linear Regression: Streaming Algorithm
•  How to compute with a single table scan?
XT
X
XT
y
-1
XTX XTy
@ianhuston
30© Copyright 2014 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
XT
y
XT
1 y1 XT
2 y2
Segment 1 Segment 2
XT
y
Master
XT
y = XT
1 XT
2
y1
y2
= XT
i yi
@ianhuston
31© Copyright 2014 Pivotal. All rights reserved.
Demos
Ÿ  We built demos to showcase our technology pipeline, using
Python technology.
Ÿ  Two use cases:
–  Topic and Sentiment Analysis of Tweets
–  London Road Traffic Disruption prediction
@ianhuston
32© Copyright 2014 Pivotal. All rights reserved.
Topic and Sentiment Analysis Pipeline
Stored on
HDFS
Tweet
Stream
(gpfdist)
Loaded as
external tables
into GPDB
Parallel Parsing of
JSON and extraction
of fields using PL/
Python
Topic Analysis through
MADlib pLDA
Sentiment Analysis
through custom
PL/Python functions
D3.js
@ianhuston
33© Copyright 2014 Pivotal. All rights reserved.
Transport for London
Traffic Disruption feed
Pivotal Greenplum
Database Deduplication Feature Creation
Modelling & Machine Learning
d3.js	
  &	
  NVD3	
  
Interactive SVG figures
Transport Disruption Prediction Pipeline
@ianhuston
34© Copyright 2014 Pivotal. All rights reserved.
Get in touch
Feel free to contact me about PL/Python, or more generally
about Data Science and opportunities available.
@ianhuston
ihuston @ gopivotal.com
http://www.ianhuston.net
BUILT FOR THE SPEED OF BUSINESS

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
CHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram librariesCHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram librariesHenry Schreiner
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
IRIS-HEP Retreat: Boost-Histogram Roadmap
IRIS-HEP Retreat: Boost-Histogram RoadmapIRIS-HEP Retreat: Boost-Histogram Roadmap
IRIS-HEP Retreat: Boost-Histogram RoadmapHenry Schreiner
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
Sempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on HadoopSempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on HadoopAlexander Schätzle
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 

Was ist angesagt? (19)

Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Apache pig
Apache pigApache pig
Apache pig
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
CHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram librariesCHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram libraries
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
IRIS-HEP Retreat: Boost-Histogram Roadmap
IRIS-HEP Retreat: Boost-Histogram RoadmapIRIS-HEP Retreat: Boost-Histogram Roadmap
IRIS-HEP Retreat: Boost-Histogram Roadmap
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Sempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on HadoopSempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on Hadoop
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 

Ähnlich wie Built for Speed of Business Analytics

Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In PythonMarwan Osman
 
Extensions on PostgreSQL
Extensions on PostgreSQLExtensions on PostgreSQL
Extensions on PostgreSQLAlpaca
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 

Ähnlich wie Built for Speed of Business Analytics (20)

Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
Extensions on PostgreSQL
Extensions on PostgreSQLExtensions on PostgreSQL
Extensions on PostgreSQL
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 

Mehr von PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Mehr von PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Kürzlich hochgeladen

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Kürzlich hochgeladen (20)

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Built for Speed of Business Analytics

  • 1. BUILT FOR THE SPEED OF BUSINESS
  • 2. @ianhuston 2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved. Massively Parallel Processing with Procedural Python How do we use the PyData stack in data science engagements at Pivotal? Ian Huston, @ianhuston Data Scientist, Pivotal
  • 3. @ianhuston 3© Copyright 2014 Pivotal. All rights reserved. Some Links for this talk Ÿ  Simple code examples: https://github.com/ihuston/plpython_examples Ÿ  IPython notebook rendered with nbviewer: http://tinyurl.com/ih-plpython Ÿ  More info (written for PL/R but applies to PL/Python): http://gopivotal.github.io/gp-r/ Ÿ  Traffic Disruption demo (if we have time) http://ds-demo-transport.cfapps.io
  • 4. @ianhuston 4© Copyright 2014 Pivotal. All rights reserved. About Pivotal Cloud Storage Virtualization Data & Analytics Platform Cloud Application Platform Data-Driven Application Development Pivotal Data Science Labs
  • 5. @ianhuston 5© Copyright 2014 Pivotal. All rights reserved. What do our customers look like? Ÿ  Large enterprises with lots of data collected –  Work with 10s of TBs to PBs of data, structured & unstructured Ÿ  Not able to get what they want out of their data –  Old Legacy systems with high cost and no flexibility –  Response times are too slow for interactive data analysis –  Can only deal with small samples of data locally Ÿ  They want to transform into data driven enterprises
  • 6. @ianhuston 6© Copyright 2014 Pivotal. All rights reserved. Open Source is Pivotal
  • 7. @ianhuston 7© Copyright 2014 Pivotal. All rights reserved. Pivotal’s Open Source Contributions Lots more interesting small projects: •  PyMADlib – Python Wrapper for MADlib https://github.com/gopivotal/pymadlib •  PivotalR – R wrapper for MADlib http://github.com/madlib-internal/PivotalR •  Part-of-speech tagger for Twitter via SQL http://vatsan.github.io/gp-ark-tweet-nlp/ •  Pandas via psql (interactive PostgreSQL terminal) https://github.com/vatsan/pandas_via_psql
  • 8. @ianhuston 8© Copyright 2014 Pivotal. All rights reserved. Typical Engagement Tech Setup Ÿ  Platform: –  Greenplum Analytics Database (GPDB) –  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop) Ÿ  Open Source Options (http://gopivotal.com): –  Greenplum Community Edition –  Pivotal HD Community Edition (HAWQ not included) –  MADlib in-database machine learning library (http://madlib.net) Ÿ  Where Python fits in: –  PL/Python running in-database, with nltk, scikit-learn etc –  IPython for exploratory analysis –  Pandas, Matplotlib etc.
  • 9. @ianhuston 9© Copyright 2014 Pivotal. All rights reserved. 1 Find Data Platforms •  Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS 2 Write Code Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R 3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp 4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL Hadoop •  HAWQ •  Pig •  Hive •  Java 5 Implement Algorithms Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata 6 Show Results Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau •  GraphViz •  Gephi •  R (ggplot2, lattice, shiny) •  Excel 7 Collaborate Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive & Hangouts PIVOTAL DATA SCIENCE TOOLKIT A large and varied tool box!
  • 10. @ianhuston 10© Copyright 2014 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright 2014 Pivotal. All rights reserved. PL/Python
  • 11. @ianhuston 11© Copyright 2014 Pivotal. All rights reserved. MPP Architectural Overview Think of it as multiple PostGreSQL servers Workers Master
  • 12. @ianhuston 12© Copyright 2014 Pivotal. All rights reserved. Data Parallelism Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  map() function in Python
  • 13. @ianhuston 13© Copyright 2014 Pivotal. All rights reserved. User-Defined Functions (UDFs) Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Ÿ  Simple UDFs are SQL queries with calling arguments and return types. Definition: CREATE  FUNCTION  times2(INT)   RETURNS  INT   AS  $$          SELECT  2  *  $1   $$  LANGUAGE  sql;   Execution: SELECT  times2(1);    times2     -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐              2   (1  row)  
  • 14. @ianhuston 14© Copyright 2014 Pivotal. All rights reserved. Ÿ  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster •  Data Parallelism: -  PL/X piggybacks on Greenplum’s MPP architecture •  Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages Standby Master … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
  • 15. @ianhuston 15© Copyright 2014 Pivotal. All rights reserved. Intro to PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install. Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;   SQL wrapper SQL wrapper Normal Python
  • 16. @ianhuston 16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2014 Pivotal. All rights reserved. Examples
  • 17. @ianhuston 17© Copyright 2014 Pivotal. All rights reserved. Returning Results Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ  Composite types can be returned by creating a composite type in the database:   CREATE  TYPE  named_value  AS  (      name    text,      value    integer   );   Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value   AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value   $$  LANGUAGE  plpythonu;   Ÿ  For functions which return multiple rows, prefix “setof” before the return type
  • 18. @ianhuston 18© Copyright 2014 Pivotal. All rights reserved. Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value   AS  $$      return  ([  name,  1  ],  [  name,  2  ],  [  name,  3])     $$  LANGUAGE  plpythonu;   Sequence Generator CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value    AS  $$      for  i  in  range(3):              yield  (name,  i)     $$  LANGUAGE  plpythonu;  
  • 19. @ianhuston 19© Copyright 2014 Pivotal. All rights reserved. Accessing Packages Ÿ  On Greenplum DB: To be available packages must be installed on the individual segment nodes. –  Can use “parallel ssh” tool gpssh to conda/pip install –  Currently Greenplum DB ships with Python 2.6 (!) Ÿ  Then just import as usual inside function:     CREATE  FUNCTION  make_pair  (name  text)      RETURNS  named_value   AS  $$      import  numpy  as  np      return  ((name,i)  for  i  in  np.arange(3))   $$  LANGUAGE  plpythonu;  
  • 20. @ianhuston 20© Copyright 2014 Pivotal. All rights reserved. Benefits of PL/Python Ÿ  Easy to bring your code to the data. Ÿ  When SQL falls short leverage your Python (or R/Java/C) experience quickly. Ÿ  Apply Python across terabytes of data with minimal overhead or additional requirements. Ÿ  Results are already in the database system, ready for further analysis or storage.
  • 21. @ianhuston 21© Copyright 2014 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright 2014 Pivotal. All rights reserved. MADlib
  • 22. @ianhuston 22© Copyright 2014 Pivotal. All rights reserved. Going Beyond Data Parallelism Ÿ  Data Parallel computation via PL/Python libraries only allow us to run ‘n’ models in parallel. Ÿ  This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data Ÿ  For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.
  • 23. @ianhuston 23© Copyright 2014 Pivotal. All rights reserved. MADlib: The Origin •  First mention of MAD analytics was at VLDB 2009 MAD Skills: New Analysis Practices for Big Data J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton (with help from: Noelle Sio, David Hubbard, James Marca) http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf •  MADlib project initiated in late 2010: Greenplum Analytics team and Prof. Joe Hellerstein UrbanDictionary mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills •  Open Source! https://github.com/madlib/madlib •  Works on Greenplum DB, PostgreSQL and also HAWQ & Impala •  Active development by Pivotal -  Latest Release: v1.4 (Nov 2013) •  Downloads and Docs: http://madlib.net/
  • 24. @ianhuston 24© Copyright 2014 Pivotal. All rights reserved. MADlib Executes Algorithms In-Place MADlib Advantages Ø  No Data Movement Ø  Use MPP architecture’s full compute power Ø  Use MPP architecture’s entire memory to process data sets SQL M SQL M SQL M Segment Processors Master Processor M MADlib User
  • 25. @ianhuston 25© Copyright 2014 Pivotal. All rights reserved. MADlib In-Database Functions Predictive Modeling Library Linear Systems •  Sparse and Dense Solvers Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Descriptive Statistics Sketch-based Estimators •  CountMin (Cormode- Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions
  • 26. @ianhuston 26© Copyright 2014 Pivotal. All rights reserved. Architecture RDBMS Query Processing (Greenplum, PostgreSQL, …) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) RDBMS Built-in Functions User Interface High-level Abstraction Layer (iteration controller, ...) Functions for Inner Loops (for streaming algorithms) “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) Python Python with templated SQL SQL, generated from specification C++
  • 27. @ianhuston 27© Copyright 2014 Pivotal. All rights reserved. How does it work ? : A Linear Regression Example Ÿ  Finding linear dependencies between variables –  y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 from unm limit 6; y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design Matrix X Vector of dependent variables y
  • 28. @ianhuston 28© Copyright 2014 Pivotal. All rights reserved. Reminder: Linear-Regression Model •  •  If residuals i.i.d. Gaussians with standard deviation σ: –  max likelihood ⇔ min sum of squared residuals •  First-order conditions for the following quadratic objective (in c) yield the minimizer f(y | x) ∝ exp − 1 2σ2 · (y − xT c)2
  • 29. @ianhuston 29© Copyright 2014 Pivotal. All rights reserved. Linear Regression: Streaming Algorithm •  How to compute with a single table scan? XT X XT y -1 XTX XTy
  • 30. @ianhuston 30© Copyright 2014 Pivotal. All rights reserved. Linear Regression: Parallel Computation XT y XT 1 y1 XT 2 y2 Segment 1 Segment 2 XT y Master XT y = XT 1 XT 2 y1 y2 = XT i yi
  • 31. @ianhuston 31© Copyright 2014 Pivotal. All rights reserved. Demos Ÿ  We built demos to showcase our technology pipeline, using Python technology. Ÿ  Two use cases: –  Topic and Sentiment Analysis of Tweets –  London Road Traffic Disruption prediction
  • 32. @ianhuston 32© Copyright 2014 Pivotal. All rights reserved. Topic and Sentiment Analysis Pipeline Stored on HDFS Tweet Stream (gpfdist) Loaded as external tables into GPDB Parallel Parsing of JSON and extraction of fields using PL/ Python Topic Analysis through MADlib pLDA Sentiment Analysis through custom PL/Python functions D3.js
  • 33. @ianhuston 33© Copyright 2014 Pivotal. All rights reserved. Transport for London Traffic Disruption feed Pivotal Greenplum Database Deduplication Feature Creation Modelling & Machine Learning d3.js  &  NVD3   Interactive SVG figures Transport Disruption Prediction Pipeline
  • 34. @ianhuston 34© Copyright 2014 Pivotal. All rights reserved. Get in touch Feel free to contact me about PL/Python, or more generally about Data Science and opportunities available. @ianhuston ihuston @ gopivotal.com http://www.ianhuston.net
  • 35. BUILT FOR THE SPEED OF BUSINESS