SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Generalized Linear Models
in Spark MLlib and SparkR
Xiangrui Meng
joint with Joseph Bradley, Eric Liang, Yanbo Liang (MiningLamp), DB Tsai (Netflix), et al.
2016/02/17 - Spark Summit East
About me
• Software Engineer at Databricks
• Spark PMC member and MLlib/PySpark maintainer
• Ph.D. from Stanford on randomized algorithms for large-
scale linear regression problems
2
Outline
• Generalized linear models (GLMs)
• linear regression / logistic regression / general form
• accelerated failure time (AFT) model for survival analysis
• intercept / regularization / weights
• GLMs in MLlib and SparkR
• demo: R formula in Spark
• Implementing GLMs
• gradient descent / L-BFGS / OWL-QN
• weighted least squares / iteratively re-weighted least squares (IRLS)
• performance tips
3
Generalized linear models
Linear regression
5
https://en.wikipedia.org/wiki/Linear_regression
inference / prediction
Linear least squares
• m observations:
• x: explanatory variables, y: dependent variable
• assumes linear relationship between x and y
• minimizes the sum of the squares of the errors
6
Linear least squares
• the oldest linear model, trace back to Gauss
• the simplest and the most studied linear model
• has analytic solutions
• easy to solve
• easy to inspect
• sensitive to outliers
7
Logistic regression
8
https://en.wikipedia.org/wiki/Logistic_regression
Logistic regression
• classification with binary response:
• true/false, clicked/not clicked, liked/disliked
• uses logistic function to indicate the likelihood
• maximizes the sum of the log-likelihoods, i.e.,
9
Logistic regression
• one of the simplest binary classification models
• widely used in industry
• relatively easy to solve
• easy to interpret
10
Multinomial logistic regression
• classification with multiclass response:
• uses softmax function to indicate likelihood
• maximizes the sum of log-likelihoods
11
Generalized linear models (GLMs)
• Both linear least squares and logistic regression are
special cases of generalized linear models.
• A GLM is specified by the following:
• a distribution of the response (from the exponential family),
• a link function g such that
• maximizes the sum of log-likelihoods
12
Distributions and link functions
13
Model Distribution Link
linear least squares normal identity
logistic regression binomial logit
multinomial logic regression multinomial generalized logit
Poisson regression Poisson log
gamma regression gamma inverse
Accelerated failure time (AFT) model
• m observations:
• y: survival time, c: censor variable (alive or dead)
• assumes the effect of an explanatory variable is to
accelerate or decelerate the life time by some constant
• uses maximum likelihood estimation while treating
censored and uncensored observations differently
14
AFT model for survival analysis
• one popular parametric model for survival analysis
• widely used for lifetime estimation and churn analysis
• could be solved under the same framework as GLMs
15
Intercept, regularization, and weights
In practice, a linear model is often more complex





where w describes instance weights, beta_0 is the
intercept term to adjust bias, and sigma regularized beta
with a constant lambda > 0 to avoid overfitting.
16
Types of regularization
• Ridge (L2):
• easy to solve (strongly convex)
• Lasso (L1):
• enhance model sparsity
• harder to solver (though still convex)
• Elastic-Net:
• Others: group lasso, nonconvex, etc
17
GLMs in MLlib and SparkR
GLMs in Spark MLlib
Linear models in MLlib are implemented as ML pipeline
estimators. They accept the following params:
• featuresCol: a vector column containing features (x)
• labelCol: a double column containing responses (y)
• weightCol: a double column containing weights (w)
• regType: regularization type, “none”, “l1”, “l2”, “elastic-net”
• regParam: regularization constant
• fitIntercept: whether to fit an intercept term
• …
19
Fit a linear model in MLlib
from pyspark.ml.classification import LogisticRegression
# Load training data
training = sqlContext.read.parquet(“path/to/training“)
lr = LogisticRegression(
weightCol=“weight”, fitIntercept=False, maxIter=10,
regParam=0.3, elasticNetParam=0.8)
# Fit the model
model = lr.fit(training)
20
Make predictions and evaluate models
from pyspark.ml.evaluation import BinaryClassificationEvaluator
test = sqlContext.read.parquet(“path/to/test“)
# make predictions by calling transform
predictions = model.tranform(test)
# create a binary classification evaluator
evaluator = BinaryClassificationEvaluator(
metricName=“areaUnderROC”)
evaulator.evaluate(predictions)
21
GLMs in SparkR
In Python/Scala/Java, we keep the APIs about the same
for consistency. But in SparkR, we make the APIs similar
to existing ones in R (or R packages).
# Create the DataFrame
df <- read.df(sqlContext, “path/to/training”)
# Fit a Gaussian GLM model
model <- glm(y ~ x1 + x2, data = df, family = "gaussian")
22
R formula in SparkR
23
• R provides model formula to express linear models.
• We support the following R formula operators in SparkR:
• `~` separate target and terms
• `+` concat terms, "+ 0" means removing intercept
• `-` remove a term, "- 1" means removing intercept
• `:` interaction (multiplication for numeric values, or binarized categorical
values)
• `.` all columns except target
• For example, “y ~ x + z + x:z -1” means using x, z, and their
interaction (x:z) to predict y without intercept (-1).
Demo: GLMs in Spark
… using Databricks Community Edition!
Implementing GLMs
Row-based distributed storage
26
w x y
w x y
partition 1
partition 2
Gradient descent methods
• Stochastic gradient descent (SGD):
• trade-offs on the merge scheme and convergence
• Mini-batch SGD:
• hard to sample mini-batches efficiently
• communication overhead on merging gradients
• Batch gradient descent:
• slow convergence
27
Quasi-Newton methods
• Newton’s method converges much than GD, but it
requires second-order information:
• L-BFGS works for smooth objectives. It approximates the
inverse Hessian using only first-order information.
• OWL-QN works for objectives with L1 regularization.
• MLlib calls L-BFGS/OWL-QN implemented in breeze.
28
Direct methods for linear least squares
• Linear least squares has an analytic solution:
• The solution could be computed directly or through QR
factorization, both of which are implemented in Spark.
• requires only a single pass
• efficient when the number of features is small (<4000)
• provides R-like model summary statistics
29
Iteratively re-weighted least squares (IRLS)
• Generalized linear models with exponential family can
be solved via iteratively re-weighted least squares (IRLS).
• linearizes the objective at the current solution
• solves the weighted linear least squares problem
• repeat above steps until convergence
• efficient when the number of features is small (<4000)
• provides R-like model summary statistics
• This is the implementation in R.
30
Verification using R
Besides normal tests, we also verify our implementation using R.
/*

df <- as.data.frame(cbind(A, b))

for (formula in c(b ~ . -1, b ~ .)) {

model <- lm(formula, data=df, weights=w)

print(as.vector(coef(model)))

}



[1] -3.727121 3.009983

[1] 18.08 6.08 -0.60

*/
val expected = Seq(Vectors.dense(0.0, -3.727121, 3.009983),

Vectors.dense(18.08, 6.08, -0.60))
31
Standardization
To match the result in both R and glmnet, the most
popular R package for GLMs, we provide options to
standardize features and labels before training:







where delta is the stddev of labels, and sigma_j is the
stddev of the j-th feature column.
32
Performance tips
• Utilize sparsity.
• Use tree aggregation and torrent broadcast.
• Watch numerical issues, e.g., log(1+exp(x)).
• Do not change input data. Scaling could be applied after
each iteration and intercept could be derived later.
33
Future directions
• easy handling of categorical features and labels
• better R formula support
• more model summary statistics
• feature parity in different languages
• model parallelism
• vector-free L-BFGS with 2D partitioning (WIP)
• using matrix kernels
34
Other GLM implementations on Spark
• CoCoA+: communication-efficient optimization
• LIBLINEAR for Spark: a Spark port of LIBLINEAR
• sparkGLM: an R-like GLM package for Spark
• TFOCS for Spark: first-order conic solvers for Spark
• General-purpose packages that implement GLMs
• aerosolve, DistML, sparkling-water, thunder, zen, etc
• … and more on Spark Packages
35
Thank you.
• MLlibuserguideandroadmapforSpark2.0
• GLMsonWikipedia
• DatabricksCommunityEdition,blogposts,andcareers

Weitere ähnliche Inhalte

Was ist angesagt?

Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
 

Was ist angesagt? (20)

Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 

Andere mochten auch

Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Spark Summit
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsSpark Summit
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaSpark Summit
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaSpark Summit
 
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...Spark Summit
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...Spark Summit
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Spark Summit
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...Spark Summit
 
Computational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local AdsComputational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local Adssoupsranjan
 
Exploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalExploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalRunwei Qiang
 
(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scaleLawrence Evans
 
Data Scientist Workbench 入門
Data Scientist Workbench 入門Data Scientist Workbench 入門
Data Scientist Workbench 入門soh kaijima
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Spark Summit
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 

Andere mochten auch (20)

Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
 
Training
TrainingTraining
Training
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida Ha
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
 
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
Computational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local AdsComputational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local Ads
 
Exploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalExploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog Retrieval
 
(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale
 
Data Scientist Workbench 入門
Data Scientist Workbench 入門Data Scientist Workbench 入門
Data Scientist Workbench 入門
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 

Ähnlich wie Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng

Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlibXiangrui Meng
 
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...WithTheBest
 
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...SSA KPI
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Ukraine
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonChristopher Conlan
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustmentTerence Huang
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymSri Ambati
 
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...Jihun Yun
 
Lecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.airLecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.airatutor_te
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Philip Goddard
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Sparkfelixcss
 

Ähnlich wie Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng (20)

Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
 
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 
Group Project
Group ProjectGroup Project
Group Project
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
 
Module-2_ML.pdf
Module-2_ML.pdfModule-2_ML.pdf
Module-2_ML.pdf
 
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
 
Machine Learning - Supervised Learning
Machine Learning - Supervised LearningMachine Learning - Supervised Learning
Machine Learning - Supervised Learning
 
Machine Learning - Principles
Machine Learning - PrinciplesMachine Learning - Principles
Machine Learning - Principles
 
Lecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.airLecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.air
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Java 8
Java 8Java 8
Java 8
 
MatlabIntro (1).ppt
MatlabIntro (1).pptMatlabIntro (1).ppt
MatlabIntro (1).ppt
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng

  • 1. Generalized Linear Models in Spark MLlib and SparkR Xiangrui Meng joint with Joseph Bradley, Eric Liang, Yanbo Liang (MiningLamp), DB Tsai (Netflix), et al. 2016/02/17 - Spark Summit East
  • 2. About me • Software Engineer at Databricks • Spark PMC member and MLlib/PySpark maintainer • Ph.D. from Stanford on randomized algorithms for large- scale linear regression problems 2
  • 3. Outline • Generalized linear models (GLMs) • linear regression / logistic regression / general form • accelerated failure time (AFT) model for survival analysis • intercept / regularization / weights • GLMs in MLlib and SparkR • demo: R formula in Spark • Implementing GLMs • gradient descent / L-BFGS / OWL-QN • weighted least squares / iteratively re-weighted least squares (IRLS) • performance tips 3
  • 6. Linear least squares • m observations: • x: explanatory variables, y: dependent variable • assumes linear relationship between x and y • minimizes the sum of the squares of the errors 6
  • 7. Linear least squares • the oldest linear model, trace back to Gauss • the simplest and the most studied linear model • has analytic solutions • easy to solve • easy to inspect • sensitive to outliers 7
  • 9. Logistic regression • classification with binary response: • true/false, clicked/not clicked, liked/disliked • uses logistic function to indicate the likelihood • maximizes the sum of the log-likelihoods, i.e., 9
  • 10. Logistic regression • one of the simplest binary classification models • widely used in industry • relatively easy to solve • easy to interpret 10
  • 11. Multinomial logistic regression • classification with multiclass response: • uses softmax function to indicate likelihood • maximizes the sum of log-likelihoods 11
  • 12. Generalized linear models (GLMs) • Both linear least squares and logistic regression are special cases of generalized linear models. • A GLM is specified by the following: • a distribution of the response (from the exponential family), • a link function g such that • maximizes the sum of log-likelihoods 12
  • 13. Distributions and link functions 13 Model Distribution Link linear least squares normal identity logistic regression binomial logit multinomial logic regression multinomial generalized logit Poisson regression Poisson log gamma regression gamma inverse
  • 14. Accelerated failure time (AFT) model • m observations: • y: survival time, c: censor variable (alive or dead) • assumes the effect of an explanatory variable is to accelerate or decelerate the life time by some constant • uses maximum likelihood estimation while treating censored and uncensored observations differently 14
  • 15. AFT model for survival analysis • one popular parametric model for survival analysis • widely used for lifetime estimation and churn analysis • could be solved under the same framework as GLMs 15
  • 16. Intercept, regularization, and weights In practice, a linear model is often more complex
 
 
 where w describes instance weights, beta_0 is the intercept term to adjust bias, and sigma regularized beta with a constant lambda > 0 to avoid overfitting. 16
  • 17. Types of regularization • Ridge (L2): • easy to solve (strongly convex) • Lasso (L1): • enhance model sparsity • harder to solver (though still convex) • Elastic-Net: • Others: group lasso, nonconvex, etc 17
  • 18. GLMs in MLlib and SparkR
  • 19. GLMs in Spark MLlib Linear models in MLlib are implemented as ML pipeline estimators. They accept the following params: • featuresCol: a vector column containing features (x) • labelCol: a double column containing responses (y) • weightCol: a double column containing weights (w) • regType: regularization type, “none”, “l1”, “l2”, “elastic-net” • regParam: regularization constant • fitIntercept: whether to fit an intercept term • … 19
  • 20. Fit a linear model in MLlib from pyspark.ml.classification import LogisticRegression # Load training data training = sqlContext.read.parquet(“path/to/training“) lr = LogisticRegression( weightCol=“weight”, fitIntercept=False, maxIter=10, regParam=0.3, elasticNetParam=0.8) # Fit the model model = lr.fit(training) 20
  • 21. Make predictions and evaluate models from pyspark.ml.evaluation import BinaryClassificationEvaluator test = sqlContext.read.parquet(“path/to/test“) # make predictions by calling transform predictions = model.tranform(test) # create a binary classification evaluator evaluator = BinaryClassificationEvaluator( metricName=“areaUnderROC”) evaulator.evaluate(predictions) 21
  • 22. GLMs in SparkR In Python/Scala/Java, we keep the APIs about the same for consistency. But in SparkR, we make the APIs similar to existing ones in R (or R packages). # Create the DataFrame df <- read.df(sqlContext, “path/to/training”) # Fit a Gaussian GLM model model <- glm(y ~ x1 + x2, data = df, family = "gaussian") 22
  • 23. R formula in SparkR 23 • R provides model formula to express linear models. • We support the following R formula operators in SparkR: • `~` separate target and terms • `+` concat terms, "+ 0" means removing intercept • `-` remove a term, "- 1" means removing intercept • `:` interaction (multiplication for numeric values, or binarized categorical values) • `.` all columns except target • For example, “y ~ x + z + x:z -1” means using x, z, and their interaction (x:z) to predict y without intercept (-1).
  • 24. Demo: GLMs in Spark … using Databricks Community Edition!
  • 26. Row-based distributed storage 26 w x y w x y partition 1 partition 2
  • 27. Gradient descent methods • Stochastic gradient descent (SGD): • trade-offs on the merge scheme and convergence • Mini-batch SGD: • hard to sample mini-batches efficiently • communication overhead on merging gradients • Batch gradient descent: • slow convergence 27
  • 28. Quasi-Newton methods • Newton’s method converges much than GD, but it requires second-order information: • L-BFGS works for smooth objectives. It approximates the inverse Hessian using only first-order information. • OWL-QN works for objectives with L1 regularization. • MLlib calls L-BFGS/OWL-QN implemented in breeze. 28
  • 29. Direct methods for linear least squares • Linear least squares has an analytic solution: • The solution could be computed directly or through QR factorization, both of which are implemented in Spark. • requires only a single pass • efficient when the number of features is small (<4000) • provides R-like model summary statistics 29
  • 30. Iteratively re-weighted least squares (IRLS) • Generalized linear models with exponential family can be solved via iteratively re-weighted least squares (IRLS). • linearizes the objective at the current solution • solves the weighted linear least squares problem • repeat above steps until convergence • efficient when the number of features is small (<4000) • provides R-like model summary statistics • This is the implementation in R. 30
  • 31. Verification using R Besides normal tests, we also verify our implementation using R. /*
 df <- as.data.frame(cbind(A, b))
 for (formula in c(b ~ . -1, b ~ .)) {
 model <- lm(formula, data=df, weights=w)
 print(as.vector(coef(model)))
 }
 
 [1] -3.727121 3.009983
 [1] 18.08 6.08 -0.60
 */ val expected = Seq(Vectors.dense(0.0, -3.727121, 3.009983),
 Vectors.dense(18.08, 6.08, -0.60)) 31
  • 32. Standardization To match the result in both R and glmnet, the most popular R package for GLMs, we provide options to standardize features and labels before training:
 
 
 
 where delta is the stddev of labels, and sigma_j is the stddev of the j-th feature column. 32
  • 33. Performance tips • Utilize sparsity. • Use tree aggregation and torrent broadcast. • Watch numerical issues, e.g., log(1+exp(x)). • Do not change input data. Scaling could be applied after each iteration and intercept could be derived later. 33
  • 34. Future directions • easy handling of categorical features and labels • better R formula support • more model summary statistics • feature parity in different languages • model parallelism • vector-free L-BFGS with 2D partitioning (WIP) • using matrix kernels 34
  • 35. Other GLM implementations on Spark • CoCoA+: communication-efficient optimization • LIBLINEAR for Spark: a Spark port of LIBLINEAR • sparkGLM: an R-like GLM package for Spark • TFOCS for Spark: first-order conic solvers for Spark • General-purpose packages that implement GLMs • aerosolve, DistML, sparkling-water, thunder, zen, etc • … and more on Spark Packages 35
  • 36. Thank you. • MLlibuserguideandroadmapforSpark2.0 • GLMsonWikipedia • DatabricksCommunityEdition,blogposts,andcareers