More Related Content Similar to Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup) (20) Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)2. Data Science as a Commodity:
How to use MADlib, R, and other Publicly
Available and Open Source Tools for Data
Science
Pivotal OSS Meetups
Sarah Aerni
Pivotal Senior Data Scientist
@itweetsarah
saerni@gopivotal.com
January 28, 2014
© Copyright 2014 Pivotal. All rights reserved.
2
3. What we will cover in today’s Meetup
Ÿ What is data science, big data,
buzzword, buzzword?
Ÿ What are some examples of data
science in action?
Ÿ What do I do at Pivotal?
Ÿ Who are our data scientists?
Ÿ Why is open source software
important for data science?
Ÿ What do I do with loads of data?
Ÿ How can I create good models?
Ÿ What types of open source tools can
I use to build models?
Ÿ How can I build a quick app?
Ÿ What can I do to get started
analyzing text data?
Ÿ Which tools exist to create
Ÿ What tools does our team use? For
visualizations of my data that I can
NLP? For optimization? For
understand?
regression?
© Copyright 2014 Pivotal. All rights reserved.
3
4. What we will not cover #notdatascience
© Copyright 2014 Pivotal. All rights reserved.
4
5. Instead: Practical Data Science Tools #useful
– Kaushik Das
http://blog.gopivotal.com/p-o-v/the-eightfold-path-of-data-science
© Copyright 2014 Pivotal. All rights reserved.
5
6. Instead: Practical Data Science Tools #useful
“At companies where there is no
framework for operationalization
of the models, PowerPoint is
where models go to die!”
– Hulya Farinas
http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/
© Copyright 2014 Pivotal. All rights reserved.
6
7. Instead: Practical Data Science Tools #useful
“At companies where there is no
framework for operationalization
of the models, PowerPoint is
where models go to die!”
– Hulya Farinas
http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/
“The use of statistical and machine
learning techniques on big multistructured data — in a distributed
computing environment — to identify
correlations and causal relationships,
classify and predict events, identify
patterns and anomalies, and infer
probabilities, interest, and sentiment.”
– Annika Jimenez
http://blog.gopivotal.com/news-2/annika-jimenez-ondisruptive-data-science-at-the-strata-conference
© Copyright 2014 Pivotal. All rights reserved.
7
8. DATA
IS THE NEW
CENTER OF GRAVITY
Data > Application!
“BIG DATA IS THE NEW NORMAL”
“‘BIG DATA’ BECOMES ‘DATA’ ONCE AGAIN”
© Copyright 2014 Pivotal. All rights reserved.
8
9. What Can “Small Data” Scientists Bring on Their
“Big Data” Journey?
http://factspy.net/the-differencebetween-geeks-vs-nerds/
© Copyright 2014 Pivotal. All rights reserved.
9
10. What Can “Small Data” Scientists Bring on Their
“Big Data” Journey?
Small Data
Databases
In-me
m
Flat files
Big Data
MapRe
duce
Many tools and
approaches are
being adapted to big
data technologies
S
HDF
Cloud computing
ory m
buildin odel
g
Command-line
tools
© Copyright 2014 Pivotal. All rights reserved.
pu
d com
tribute
ting
Dis
Command-line tools
10
11. Basic DS Tools: From Command-line to GUI
Ÿ Quick-and-dirty tricks using
command-line tools
–
–
–
–
Fast feedback - interactive
Fast to process
Easy to write, hard to read
Background processing (screen)
Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.
11
12. Basic DS Tools: From Command-line to GUI
Ÿ Quick-and-dirty tricks using
command-line tools
–
–
–
–
Fast feedback - interactive
Fast to process
Easy to write, hard to read
Background processing (screen)
Ÿ Large-volumes of data à automatically parallel
environments (e.g. GPDB) may be faster
Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.
12
13. Basic DS Tools: From Command-line to GUI
Ÿ Quick-and-dirty tricks using
command-line tools
–
–
–
–
Fast feedback - interactive
Fast to process
Easy to write, hard to read
Background processing (screen)
Ÿ Large-volumes of data à automatically parallel
environments (e.g. GPDB) may be faster
Ÿ Python and R
– Rstudio
– iPython (iPythonNotebook)
Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.
13
14. Favorite python and R packages and resources
Python
– NumPy
– SciPy
– scikit-learn – machine
learning package
– statsmodels
– pandas
– pyMC
– IPython
(IPythonNotebook)
– matplotlib
Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.
14
15. Favorite python and R packages, resources, and more
Ÿ R
–
–
–
–
–
–
–
–
–
ggplot
reshape
plyr
Shiny
Good support for time
series analyses
Rstudio ( weave )
foreach, parallel
taskviews
parboost
Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.
15
16. What do I do at Pivotal?
A New Platform for a New Era
DATA-DRIVEN APPLICATION DEVELOPMENT
App Fabric
Data Fabric
“The new Middleware”
“The new Database”
Cloud Fabric
“The new OS”
...ETC
“The new Hardware”
© Copyright 2014 Pivotal. All rights reserved.
16
17. Pivotal Big Data Technology: HAWQ
Think of it as multiple PostGreSQL servers
Master
Segments/Workers
Rows are distributed across segments by
a particular field (or randomly)
Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database
© Copyright 2014 Pivotal. All rights reserved.
17
18. Performance Through Parallelism
Ÿ Automatic parallelization
– Load and query like any database
– Automatically distributed tables
across nodes
Ÿ Analytics-oriented query optimization
Ÿ Scalable MPP architecture
– All nodes can scan and process in
parallel
– Linear scalability by adding nodes
Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database
© Copyright 2014 Pivotal. All rights reserved.
18
19. Data Science Tools for Big Data
COMMERCIAL
OPEN SOURCE (OR FREE)
PL/R,
PL/Python
PL/Java
© Copyright 2014 Pivotal. All rights reserved.
19
20. Making sense of your “big data”
Ÿ Large volumes of data may be difficult to understand
– ~100 tables
– Tens of thousands of columns
© Copyright 2014 Pivotal. All rights reserved.
20
21. Making sense of your “big data”
Ÿ Large volumes of data may be difficult to understand
– ~100 tables
– Tens of thousands of columns
Ÿ How do you build models that use all the data? Score all the
data?
© Copyright 2014 Pivotal. All rights reserved.
21
22. Making sense of your “big data”
Ÿ Large volumes of data may be difficult to understand
– ~100 tables
– Tens of thousands of columns
Ÿ How do you build models that use all the data? Score all the
data?
Ÿ Where do you focus your effort?
– Getting a rapid grasp of relevant fields is important
– Scanning lots of data is slow, creating models with huge numbers of features is
possible, but generally better to understand your data
– Columns with little or no variation or only null values
© Copyright 2014 Pivotal. All rights reserved.
22
23. Making sense of your “big data”
Ÿ Large volumes of data may be difficult to understand
– ~100 tables
– Tens of thousands of columns
Ÿ How do you build models that use all the data? Score all the
data?
Ÿ Where do you focus your effort?
– Getting a rapid grasp of relevant fields is important
– Scanning lots of data is slow, creating models with huge numbers of features is
possible, but generally better to understand your data
– Columns with little or no variation or only null values
Ÿ These functions exist in MADlib
© Copyright 2014 Pivotal. All rights reserved.
23
24. MADlib In-Database Functions
Predictive Modeling Library
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Cox Proportional Hazards
• Regression
• Elastic Net Regularization
• Sandwich Estimators (Huber
white, clustered, marginal
effects)
Matrix Factorization
• Single Value Decomposition
(SVD)
• Low-Rank
© Copyright 2014 Pivotal. All rights reserved.
Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Affinity Analysis,
Market Basket)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Ensemble Learners (Random Forests)
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
Linear Systems
• Sparse and Dense Solvers
Descriptive Statistics
Sketch-based
Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
24
25. MADlib in Action: Regression on
Billions of Rows
Ÿ Input Data
– 10s of millions of rows from data collected at multiple drill
testing sites
– Sensor data for drills during operation, including rate of
penetration, depth of penetration, weight on drill bit and
more
Ÿ Data Massaging and Review
– Rapid summarization of many columns of data - to identify
outliers, missing data and remove them from analysis
– Used window functions to construct a moving average
(smoothing) of all the features and dependent variable
Ÿ Model
– Linear regression on the complete dataset
– K-means clustering to determine similarities of sites
Rashmi Raghu
© Copyright 2014 Pivotal. All rights reserved.
Drilling into the San Andreas Fault at Parkfield California.
Credit: Stephen H. Hickman, USGS
25
26. Linear Regression: Streaming Algorithm
Ÿ Finding linear
dependencies between
variables
Ÿ How to compute with a
single scan?
© Copyright 2014 Pivotal. All rights reserved.
26
28. Linear Regression: Parallel Computation
XT
y
T
X1 y1
Segment 1
© Copyright 2013 Pivotal. All rights reserved.
+
T
X 2 y2
Segment 2
=
XT y
Master
28
29. Linear Regression: Parallel Computation
XT
y
T
X1 y1
Segment 1
© Copyright 2013 Pivotal. All rights reserved.
+
T
X 2 y2
Segment 2
=
XT y
Master
29
30. Performing a linear regression on 10 million
rows in seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of
the VLDB Endowment 5.12 (2012): 1700-1711.
© Copyright 2013 Pivotal. All rights reserved.
30
31. Calling MADlib Functions: Fast Training, Scoring
Ÿ MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
Ÿ All the data can be used in one
model
MADlib model function
Table containing
training data
SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’);!
Features included in the
model
Table in which to
save results
Column containing
dependent variable
Ÿ Built-in functionality to create of
multiple smaller models (e.g.
classification grouped by feature)
Ÿ Open-source lets you tweak and
extend methods, or build your own
© Copyright 2014 Pivotal. All rights reserved.
31
32. Calling MADlib Functions: Fast Training, Scoring
Ÿ MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
Ÿ All the data can be used in one
model
Ÿ Built-in functionality to create of
multiple smaller models (e.g.
classification grouped by feature)
MADlib model function
Table containing
training data
SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’,!
‘bedroom’);!
Table in which to
save results
Column containing
dependent variable
Features included in the
model
Create multiple output models
(one for each value of bedroom)
Ÿ Open-source lets you tweak and
extend methods, or build your own
© Copyright 2014 Pivotal. All rights reserved.
32
33. Calling MADlib Functions: Fast Training, Scoring
Ÿ MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
Ÿ All the data can be used in one
model
Ÿ Built-in functionality to create of
multiple smaller models (e.g.
classification grouped by feature)
Ÿ Open-source lets you tweak and
extend methods, or build your own
© Copyright 2014 Pivotal. All rights reserved.
SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’);!
MADlib model scoring function
SELECT houses.*,
madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef!
)as predict !
FROM houses, houses_linregr m;!
Table with data to be scored
Table containing model
33
34. PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ Challenge
Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics
Ÿ Simple solution:
Translate R code into SQL
Pivotal R
d <- db.data.frame(”houses")!
houses_linregr <- madlib.lm(price ~ tax!
!
!
!+ bath!
!
!
!+ size!
!
!
!, data=d)!
SQL Code
SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’);!
http://gopivotal.github.io/PivotalR/
Woo Jung
© Copyright 2014 Pivotal. All rights reserved.
34
35. PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ Challenge
Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics
Ÿ Simple solution:
Translate R code into SQL
Pivotal R
#
#
#
#
#
Build a regression model with a different!
intercept term for each state!
(state=1 as baseline).!
Note that PivotalR supports automated!
indicator coding a la as.factor()!!
d <- db.data.frame(”houses")!
houses_linregr <- madlib.lm(price ~ as.factor(state)!
!
!
!
!+ tax!
!
!
!
!+ bath!
!
!
!
!+ size!
!
!
!
!, data=d)!
http://gopivotal.github.io/PivotalR/
Woo Jung
© Copyright 2014 Pivotal. All rights reserved.
35
36. PivotalR Design Overview
•
•
Call MADlib’s in-DB machine learning functions
directly from R
Syntax is analogous to native R function
PivotalR
R à SQL
No data here
http://gopivotal.github.io/PivotalR/
RPostgreSQL
Data lives here
SQL to execute
Computation results
•
•
Database w/ MADlib
Data doesn’t need to leave the database
All heavy lifting, including model estimation
& computation, are done in the database
Woo Jung
© Copyright 2014 Pivotal. All rights reserved.
36
37. PivotalR: Current Features
And more ... (SQL wrapper)
•
MADlib Functionality
• Linear Regression
• Logistic Regression
• Elastic Net
• ARIMA
• Marginal Effects
• Cross Validation
• Bagging
• summary on model objects
http://gopivotal.github.io/PivotalR/
© Copyright 2014 Pivotal. All rights reserved.
+ - *
%/% ^
/
%%
• Automated Indicator
Variable Coding
as.factor
• predict
•
•
•
•
•
•
•
•
•
dim names
$
[
==
&
by
[[
!=
|
$<>
!
•
•
<
[<>=
merge
sort
db.data.frame
•
•
[[<<=
•
is.na
preview
content
as.db.data.frame
c mean sum sd var min max
length colMeans colSums
db.connect db.disconnect
db.list db.objects
db.existsObject delete
37
40. Shiny Showcase: Example Web Apps in R
Ÿ Users can choose
input parameters with
sliders, drop-downs,
and text fields.
Ÿ HTML/JavaScript
knowledge not
required.
http://www.rstudio.com/shiny/
© Copyright 2014 Pivotal. All rights reserved.
40
41. Shiny Showcase: Example Web Apps in R
Ÿ Users can choose
input parameters with
sliders, drop-downs,
and text fields.
Ÿ HTML/JavaScript
knowledge not
required.
http://www.rstudio.com/shiny/
© Copyright 2014 Pivotal. All rights reserved.
41
45. PyMADlib
Ÿ Python wrapper for MADlib
http://nbviewer.ipython.org/gist/vatsan/5275846
© Copyright 2014 Pivotal. All rights reserved.
45
46. PyMADlib
Ÿ Python wrapper for MADlib
http://nbviewer.ipython.org/gist/vatsan/5275846
© Copyright 2014 Pivotal. All rights reserved.
46
47. Procedural Languages in Big Data Science
Ÿ HAWQ & PL/X can take advantage of “data
parallel” tasks by performing analyses in
parallel – embarrassingly parallel tasks
Ÿ Little or no effort is required to break up the
problem into a number of parallel tasks, and
there exists no dependency (or
communication) between those parallel
tasks
Ÿ Examples of ‘data parallel’ problems:
– Counting words in documents
– Genome-Wide Association Study
– Studying network anomalies
http://gopivotal.github.io/gp-r/
© Copyright 2014 Pivotal. All rights reserved.
SQL & R
Master
Severs
Network
Interconnect
Segment
Severs
Doc1
Doc2
DocM
Stem1
Stem2
StemM
Count1
Count2
CountM
47
48. Structure of input table for PL/R function
Columns
Description
A
Network ID
ID of the network.
300K in total.
Terminal
readings
Topology
Network Readings
Array of integers
defining the
topology tree.
Array of readings from
network terminal points
over (say) a week.
C
Ÿ Using historical readings, solve a
linear program to establish baseline
behavior, for example number of
shipments
0
B
Ÿ Topology: Hubs connected to multiple
terminal points
D
Ÿ Detecting anomalies within subnetworks on future observations
Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.
48
49. Performance Analysis
Number of
networks
Time/network
(ms)
Total time
(seconds)
500
6.604
3.30
1000
3.637
3.64
5000
2.822
14.11
400
10,000
2.356
23.56
300
50,000
2.160
108.02
200
100,000
2.142
214.20
100
150,000
2.162
324.29
200,000
2.142
428.48
250,000
2.138
534.69
300,000
2.132
639.85
Execution time v/s number of networks
Time (seconds)
700
600
500
0
0
50
100
150
200
250
Number of networks (in thousands)
300
Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.
49
50. Performance Analysis
R package used
optim
quadprog
Rsymphony
Rglpk
Single network in R (time)
~60s
6.3 s
0.145 s
0.181 s
300K networks in PL/R (time)
~84 hrs
5.87 hrs
10.7 min
14.6 min
Time per network in PL/R
1005.2 ms
70.44 ms
2.13 ms
2.92 ms
Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.
50
51. Performance Analysis
R package used
optim
quadprog
Rsymphony
Rglpk
Single network in R (time)
~60s
6.3 s
0.145 s
0.181 s
300K networks in PL/R (time)
~84 hrs
5.87 hrs
10.7 min
14.6 min
Time per network in PL/R
1005.2 ms
70.44 ms
2.13 ms
2.92 ms
COIN-OR : Computational Infrastructure for Operations
Research
http://www.coin-or.org/
– Libraries for linear and non-linear programming, integer
programming
– SYMPHONY : Callable library in COIN-OR for solving mixed
integer linear programs
GLPK : GNU Linear Programming Kit
Used for large-scale LPs, MIPs and related problems
Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.
51
52. Performance Analysis
R package used
optim
quadprog
Rsymphony
Rglpk
Single network in R (time)
~60s
6.3 s
0.145 s
0.181 s
300K networks in PL/R (time)
~84 hrs
5.87 hrs
10.7 min
14.6 min
Time per network in PL/R
1005.2 ms
70.44 ms
2.13 ms
2.92 ms
COIN-OR : Computational Infrastructure for Operations
Research
http://www.coin-or.org/
– Libraries for linear and non-linear programming, integer
programming
– SYMPHONY : Callable library in COIN-OR for solving mixed
integer linear programs
GLPK : GNU Linear Programming Kit
– Used for large-scale LPs, MIPs and related problems
http://www.gnu.org/software/glpk/
Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.
52
53. Natural language processing
Data sources
Applications
NLP processing
pipeline
Text sources
Documents, books,
emails
Sentence
detection
Tokenization
Morphological
stemming
Stop word
removal
Word-sense
disambiguation
Part-of-Speech
tagging
Syntactic
parsing
Semantic role
labeling
Entity
recognition
Reference
resolution
Speech
Phone logs,
conversations
Event
processing
Word clouds
Topic modeling
Sentiment analysis
Machine translation
Document classification
Document summarization
Language generation
Search
Question answering
Information Extraction
…
Common tasks/tools in NLP
Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.
53
54. Open source tools for common NLP tasks
RELEVANT NLP TOOLS
OPEN SOURCE SOFTWARE
WORD CLOUDS
T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N
I N F O R M AT I O N E X T R A C T I O N
Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.
54
55. Open source tools for common NLP tasks
RELEVANT NLP TOOLS
OPEN SOURCE SOFTWARE
WORD CLOUDS
Tokenization
Stemming/
lemmatization
Stop word
removal
•
•
•
GPText
Apache UIMA
OpenNLP (Java)
•
•
•
NLTK (Python)
WordNet
Pytagcloud
T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N
I N F O R M AT I O N E X T R A C T I O N
Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.
55
56. Open source tools for common NLP tasks
RELEVANT NLP TOOLS
OPEN SOURCE SOFTWARE
WORD CLOUDS
Tokenization
Stemming/
lemmatization
Stop word
removal
•
•
•
GPText
Apache UIMA
OpenNLP (Java)
•
•
•
NLTK (Python)
WordNet
Pytagcloud
T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N
Tokenization
Stemming/
lemmatization
Stop word
removal
Language
detection
•
•
•
Madlib (PLDA)
gensim (LSA & LDA package for python)
https://code.google.com/p/language-detection/
I N F O R M AT I O N E X T R A C T I O N
Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.
56
57. Open source tools for common NLP tasks
RELEVANT NLP TOOLS
OPEN SOURCE SOFTWARE
WORD CLOUDS
Tokenization
Stemming/
lemmatization
•
•
•
Stop word
removal
GPText
Apache UIMA
OpenNLP (Java)
•
•
•
NLTK (Python)
WordNet
Pytagcloud
T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N
Tokenization
Stemming/
lemmatization
Stop word
removal
Language
detection
•
•
•
Madlib (PLDA)
gensim (LSA & LDA package for python)
https://code.google.com/p/language-detection/
•
•
•
GPText and Madlib
OpenNLP
NLTK
I N F O R M AT I O N E X T R A C T I O N
Sentence
detection
Tokenization
Language
detection
Relationship
extraction
Syntactic
parsing
Entity
extraction
•
Stanford CoreNLP (incl.
POS tagger, NER, parser,
etc.)
Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.
57
58. Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content
Align
Data
Social
Media
Tokenizer
Stemming,
frequency
filtering
Prepare
dataset for
Topic Modeling
Srivatsan Ramanujam
© Copyright 2014 Pivotal. All rights reserved.
58
59. Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content
Align
Data
Social
Media
Tokenizer
Stemming,
frequency
filtering
Prepare
dataset for
Topic Modeling
Topic Graph
Topic composition
MADlib Topic
Model
Topic Clouds
Srivatsan Ramanujam
© Copyright 2014 Pivotal. All rights reserved.
59
60. Is there more? What’s next?
blog.gopivotal.com/tag/data-science
blog.gopivotal.com/tag/data-science-tech
© Copyright 2014 Pivotal. All rights reserved.
60