SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
About Me 
• Postdoc in AMPLab 
• Led initial development of MLlib 
• Technical Advisor for Databricks 
• Assistant Professor at UCLA 
• Research interests include scalability and ease-of- 
use issues in statistical machine learning
MLlib: Spark’s Machine 
Learning Library 
Ameet Talwalkar 
AMPCAMP 5 
November 20, 2014
History and Overview 
Example Applications 
Ongoing Development
MLbase and MLlib 
MLOpt 
MLI 
MLlib 
Apache Spark 
MLbase aims to 
simplify development 
and deployment of 
scalable ML pipelines 
MLlib: Spark’s core ML library 
MLI, Pipelines: APIs to simplify ML development 
• Tables, Matrices, Optimization, ML Pipelines 
Experimental 
Testbeds 
Production 
MLOpt: Declarative layer to automate hyperparameter tuning 
Code 
Pipelines
History of MLlib 
Initial Release 
• Developed by MLbase team in AMPLab (11 contributors) 
• Scala, Java 
• Shipped with Spark v0.8 (Sep 2013) 
15 months later… 
• 80+ contributors from various organization 
• Scala, Java, Python 
• Latest release part of Spark v1.1 (Sep 2014)
What’s in MLlib? 
• Alternating Least Squares 
• Lasso 
• Ridge Regression 
• Logistic Regression 
• Decision Trees 
• Naïve Bayes 
• Support Vector Machines 
• K-Means 
• Gradient descent 
• L-BFGS 
• Random data generation 
• Linear algebra 
• Feature transformations 
• Statistics: testing, correlation 
• Evaluation metrics 
Collaborative Filtering 
for Recommendation 
Prediction 
Clustering 
Optimization 
Many Utilities
Benefits of MLlib 
• Part of Spark 
• Integrated data analysis workflow 
• Free performance gains 
Apache Spark 
SparkSQL 
Spark 
Streaming MLlib GraphX
Benefits of MLlib 
• Part of Spark 
• Integrated data analysis workflow 
• Free performance gains 
• Scalable 
• Python, Scala, Java APIs 
• Broad coverage of applications & algorithms 
• Rapid improvements in speed & robustness
History and Overview 
Example Applications 
Use Cases 
Distributed ML Challenges 
Code Examples 
Ongoing Development
Clustering with K-Means 
Given data points Find meaningful clusters
Clustering with K-Means 
Choose cluster centers Assign points to clusters
Clustering with K-Means 
Assign Choose cluster centers points to clusters
Clustering with K-Means 
Choose cluster centers Assign points to clusters
Clustering with K-Means 
Choose cluster centers Assign points to clusters
Clustering with K-Means 
Data distributed by instance (point/row) 
Smart initialization 
Limited communication 
(# clusters << # instances)
K-Means: Scala 
// Load and parse data. 
val data = sc.textFile("kmeans_data.txt") 
val parsedData = data.map { x => 
Vectors.dense(x.split(‘ ').map(_.toDouble)) 
}.cache() 
// Cluster data into 5 classes using KMeans. 
val clusters = KMeans.train( 
parsedData, k = 5, numIterations = 20) 
// Evaluate clustering error. 
val cost = clusters.computeCost(parsedData) 
println("Sum of squared errors = " + cost)
K-Means: Python 
# Load and parse data. 
data = sc.textFile("kmeans_data.txt") 
parsedData = data.map(lambda line: 
array([float(x) for x in line.split(' ‘)])).cache() 
# Cluster data into 5 classes using KMeans. 
clusters = KMeans.train(parsedData, k = 5, maxIterations = 20) 
# Evaluate clustering error. 
def error(point): 
center = clusters.centers[clusters.predict(point)] 
return sqrt(sum([x**2 for x in (point - center)])) 
cost = parsedData.map(lambda point: error(point))  
.reduce(lambda x, y: x + y) 
print("Sum of squared error = " + str(cost))
Recommendation 
? 
Goal: Recommend 
movies to users
Recommendation 
Collaborative filtering 
Goal: Recommend 
movies to users 
Challenges: 
• Defining similarity 
• Dimensionality 
Millions of Users / Items 
• Sparsity
Recommendation 
Solution: Assume ratings are 
determined by a small 
number of factors. 
≈ x 
25M Users, 100K Movies 
! 2.5 trillion ratings 
With 10 factors/user 
! 250M parameters
Recommendation with Alternating Least 
Squares (ALS) 
Algorithm 
Alternating update of 
user/movie factors 
= x 
? 
?
Recommendation with Alternating Least 
Squares (ALS) 
Algorithm 
Alternating update of 
user/movie factors 
= x 
Can update factors 
in parallel 
Must be careful about 
communication
Recommendation with Alternating Least 
Squares (ALS) 
// Load and parse the data 
val data = sc.textFile("mllib/data/als/test.data") 
val ratings = data.map(_.split(',') match { 
case Array(user, item, rate) => 
Rating(user.toInt, item.toInt, rate.toDouble) 
}) 
// Build the recommendation model using ALS 
val model = ALS.train( 
ratings, rank = 10, numIterations = 20, regularizer = 0.01) 
// Evaluate the model on rating data 
val usersProducts = ratings.map { case Rating(user, product, rate) => 
(user, product) 
} 
val predictions = model.predict(usersProducts)
ALS: Today’s ML Exercise 
• Load 1M/10M ratings from MovieLens 
• Specify YOUR ratings on examples 
• Split examples into training/validation 
• Fit a model (Python or Scala) 
• Improve model via parameter tuning 
• Get YOUR recommendations
History and Overview 
Example Applications 
Ongoing Development 
Performance 
New APIs
Performance 
50 
ALS on Amazon Reviews on 16 nodes 
minutes) 
40 
(30 
Runtime 20 
10 
0 
0M 200M 400M 600M 800M MLlib 
Mahout 
Number of Ratings 
On a dataset with 660M users, 2.4M items, and 3.5B ratings 
MLlib runs in 40 minutes with 50 nodes
Performance 
Steady performance gains 
Decision Trees 
ALS 
K-Means 
Logistic 
Regression 
Ridge 
Regression 
~3X speedups on average 
Speedup 
(Spark 1.0 vs. 1.1)
Algorithms 
In Spark 1.2 
• Random Forests: ensembles of Decision Trees 
• Boosting 
Under development 
• Topic modeling 
• (many others) 
Many others!
ML Pipelines 
Typical ML workflow 
Training 
Data 
Feature 
Extraction 
Model 
Training 
Model 
Testing 
Test 
Data
ML Pipelines 
Typical ML workflow is complex. 
Training 
Data 
Feature 
Extraction 
Model 
Training 
Model 
Testing 
Test 
Data 
Feature 
Extraction 
Other 
Training 
Data 
Other Model 
Training
ML Pipelines 
Typical ML workflow is complex. 
Pipelines in 1.2 (alpha release) 
• Easy workflow construction 
• Standardized interface for model tuning 
• Testing & failing early 
Inspired by MLbase / Pipelines Project (see Evan’s talk) 
Collaboration with Databricks 
MLbase / MLOpt aims to autotune these pipelines
Datasets 
Further Integration 
with SparkSQL 
ML pipelines require Datasets 
• Handle many data types (features) 
• Keep metadata about features 
• Select subsets of features for different parts of pipeline 
• Join groups of features 
ML Dataset = SchemaRDD 
Inspired by MLbase / MLI API
MLlib Programming 
Guide 
spark.apache.org/docs/latest/mllib-guide. 
html 
Databricks training info 
databricks.com/spark-training 
Spark user lists & 
community 
spark.apache.org/community.html 
edX MOOC on Scalable 
Machine Learning 
www.edx.org/course/uc-berkeleyx/uc-berkeleyx- 
cs190-1x-scalable-machine-6066 
4-day BIDS minicourse 
in January 
Resources

Weitere ähnliche Inhalte

Was ist angesagt?

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlibTodd McGrath
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaObjectRocket
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analyticsmason_s
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 

Was ist angesagt? (20)

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Yaml
YamlYaml
Yaml
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 

Andere mochten auch

Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Marcel Caraciolo
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Collaborative Filtering using KNN
Collaborative Filtering using KNNCollaborative Filtering using KNN
Collaborative Filtering using KNNŞeyda Hatipoğlu
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionWill Johnson
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 

Andere mochten auch (8)

Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Collaborative Filtering using KNN
Collaborative Filtering using KNNCollaborative Filtering using KNN
Collaborative Filtering using KNN
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 

Ähnlich wie MLlib: Spark's Machine Learning Library

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDatabricks
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondXiangrui Meng
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackTuri, Inc.
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Karthik Murugesan
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 

Ähnlich wie MLlib: Spark's Machine Learning Library (20)

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 

Mehr von jeykottalam

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learningjeykottalam
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scalejeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Communityjeykottalam
 

Mehr von jeykottalam (8)

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learning
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 

MLlib: Spark's Machine Learning Library

  • 1. About Me • Postdoc in AMPLab • Led initial development of MLlib • Technical Advisor for Databricks • Assistant Professor at UCLA • Research interests include scalability and ease-of- use issues in statistical machine learning
  • 2. MLlib: Spark’s Machine Learning Library Ameet Talwalkar AMPCAMP 5 November 20, 2014
  • 3. History and Overview Example Applications Ongoing Development
  • 4. MLbase and MLlib MLOpt MLI MLlib Apache Spark MLbase aims to simplify development and deployment of scalable ML pipelines MLlib: Spark’s core ML library MLI, Pipelines: APIs to simplify ML development • Tables, Matrices, Optimization, ML Pipelines Experimental Testbeds Production MLOpt: Declarative layer to automate hyperparameter tuning Code Pipelines
  • 5. History of MLlib Initial Release • Developed by MLbase team in AMPLab (11 contributors) • Scala, Java • Shipped with Spark v0.8 (Sep 2013) 15 months later… • 80+ contributors from various organization • Scala, Java, Python • Latest release part of Spark v1.1 (Sep 2014)
  • 6. What’s in MLlib? • Alternating Least Squares • Lasso • Ridge Regression • Logistic Regression • Decision Trees • Naïve Bayes • Support Vector Machines • K-Means • Gradient descent • L-BFGS • Random data generation • Linear algebra • Feature transformations • Statistics: testing, correlation • Evaluation metrics Collaborative Filtering for Recommendation Prediction Clustering Optimization Many Utilities
  • 7. Benefits of MLlib • Part of Spark • Integrated data analysis workflow • Free performance gains Apache Spark SparkSQL Spark Streaming MLlib GraphX
  • 8. Benefits of MLlib • Part of Spark • Integrated data analysis workflow • Free performance gains • Scalable • Python, Scala, Java APIs • Broad coverage of applications & algorithms • Rapid improvements in speed & robustness
  • 9. History and Overview Example Applications Use Cases Distributed ML Challenges Code Examples Ongoing Development
  • 10. Clustering with K-Means Given data points Find meaningful clusters
  • 11. Clustering with K-Means Choose cluster centers Assign points to clusters
  • 12. Clustering with K-Means Assign Choose cluster centers points to clusters
  • 13. Clustering with K-Means Choose cluster centers Assign points to clusters
  • 14. Clustering with K-Means Choose cluster centers Assign points to clusters
  • 15. Clustering with K-Means Data distributed by instance (point/row) Smart initialization Limited communication (# clusters << # instances)
  • 16. K-Means: Scala // Load and parse data. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map { x => Vectors.dense(x.split(‘ ').map(_.toDouble)) }.cache() // Cluster data into 5 classes using KMeans. val clusters = KMeans.train( parsedData, k = 5, numIterations = 20) // Evaluate clustering error. val cost = clusters.computeCost(parsedData) println("Sum of squared errors = " + cost)
  • 17. K-Means: Python # Load and parse data. data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ‘)])).cache() # Cluster data into 5 classes using KMeans. clusters = KMeans.train(parsedData, k = 5, maxIterations = 20) # Evaluate clustering error. def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) cost = parsedData.map(lambda point: error(point)) .reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost))
  • 18. Recommendation ? Goal: Recommend movies to users
  • 19. Recommendation Collaborative filtering Goal: Recommend movies to users Challenges: • Defining similarity • Dimensionality Millions of Users / Items • Sparsity
  • 20. Recommendation Solution: Assume ratings are determined by a small number of factors. ≈ x 25M Users, 100K Movies ! 2.5 trillion ratings With 10 factors/user ! 250M parameters
  • 21. Recommendation with Alternating Least Squares (ALS) Algorithm Alternating update of user/movie factors = x ? ?
  • 22. Recommendation with Alternating Least Squares (ALS) Algorithm Alternating update of user/movie factors = x Can update factors in parallel Must be careful about communication
  • 23. Recommendation with Alternating Least Squares (ALS) // Load and parse the data val data = sc.textFile("mllib/data/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val model = ALS.train( ratings, rank = 10, numIterations = 20, regularizer = 0.01) // Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts)
  • 24. ALS: Today’s ML Exercise • Load 1M/10M ratings from MovieLens • Specify YOUR ratings on examples • Split examples into training/validation • Fit a model (Python or Scala) • Improve model via parameter tuning • Get YOUR recommendations
  • 25. History and Overview Example Applications Ongoing Development Performance New APIs
  • 26. Performance 50 ALS on Amazon Reviews on 16 nodes minutes) 40 (30 Runtime 20 10 0 0M 200M 400M 600M 800M MLlib Mahout Number of Ratings On a dataset with 660M users, 2.4M items, and 3.5B ratings MLlib runs in 40 minutes with 50 nodes
  • 27. Performance Steady performance gains Decision Trees ALS K-Means Logistic Regression Ridge Regression ~3X speedups on average Speedup (Spark 1.0 vs. 1.1)
  • 28. Algorithms In Spark 1.2 • Random Forests: ensembles of Decision Trees • Boosting Under development • Topic modeling • (many others) Many others!
  • 29. ML Pipelines Typical ML workflow Training Data Feature Extraction Model Training Model Testing Test Data
  • 30. ML Pipelines Typical ML workflow is complex. Training Data Feature Extraction Model Training Model Testing Test Data Feature Extraction Other Training Data Other Model Training
  • 31. ML Pipelines Typical ML workflow is complex. Pipelines in 1.2 (alpha release) • Easy workflow construction • Standardized interface for model tuning • Testing & failing early Inspired by MLbase / Pipelines Project (see Evan’s talk) Collaboration with Databricks MLbase / MLOpt aims to autotune these pipelines
  • 32. Datasets Further Integration with SparkSQL ML pipelines require Datasets • Handle many data types (features) • Keep metadata about features • Select subsets of features for different parts of pipeline • Join groups of features ML Dataset = SchemaRDD Inspired by MLbase / MLI API
  • 33. MLlib Programming Guide spark.apache.org/docs/latest/mllib-guide. html Databricks training info databricks.com/spark-training Spark user lists & community spark.apache.org/community.html edX MOOC on Scalable Machine Learning www.edx.org/course/uc-berkeleyx/uc-berkeleyx- cs190-1x-scalable-machine-6066 4-day BIDS minicourse in January Resources