SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Spark MLlib
• Experience
Vpon Data Engineer
TWM, Keywear, Nielsen
• Bryan’s notes for data analysis
http://bryannotes.blogspot.tw
• Spark.TW
• Linikedin
https://tw.linkedin.com/pub/bryan-yang/7b/763/a79
ABOUT ME
Agenda
• Introduction to Machine Learning
• Why MLlib
• Basic Statistic
• K-means
• Logistic Regression
• ALS
• Demo
Introduction to Machine
Learning
Definition
• Machine learning is a study of computer algorithms
that improve automatically through experience.
Terminology
• Observation
- The basic unit of data.
• Features
- Features is that can be used to describe each observation
in a quantitative manner.
• Feature vector
- is an n-dimensional vector of numerical features that
represents some object.
• Training/ Testing/ Evaluation set
- Set of data to discover potential predictive relationships
Learning(Training)
• Features:
1. Color: Red/ Green
2. Type: Fruit
3. Shape: nearly Circle
etc…
Learning(Training)
ID Color Type Shape
is
apple (
Label)
1 Red Fruit Cirle Y
2 Red Fruit Cirle Y
3 Black Logo
nearly
Cirle
N
4 Blue N/A Cirle N
Categories of Machine Learning
• Classification: predict class from observations.
• Clustering: group observations into meaningful
groups.
• Regression: predict value from observations.
Cluster the observations with no Labels
https://en.wikipedia.org/wiki/Cluster_analysis
Cut the observations
http://stats.stackexchange.com/questions/159957/how-to-do-one-vs-one-classification-for-logistic-regression
Find a model to describe the
observations
https://commons.wikimedia.org/wiki/File:Linear_regression.svg
Machine Learning
Pipelines
http://www.nltk.org/book/ch06.html
What is MLlib
What is MLlib
• MLlib is an Apache Spark component focusing on
machine learning:
• MLlib is Spark’s core ML library
• Developed by MLbase team in AMPLab
• 80+ contributions from various organization
• Support Scala, Python, and Java APIs
Algorithms in MLlib
• Statistics: Description, correlation
• Clustering: k-means
• Collaborative filtering: ALS
• Classification: SVMs, naive Bayes, decision tree.
• Regression: linear regression, logistic regression
• Dimensionality: SVD, PCA
Why Mllib
• Scalability
• Performance
• user-friendly documentation and APIs
• Cost of maintenance
Performance
Data Type
• Dense vector
• Sparse vector
• Labeled point
Dense & Sparse
• Raw Data:
ID
A
B C D E F
1 1 0 0 0 0 3
2 0 1 0 1 0 2
3 1 1 1 0 1 1
Dense vs Sparse
• Training Set:
- number of example: 12 million
- number of features: 500
- sparsity: 10%
• Not only save storage, but also received a 4x speed
up
Dense Sparse
Storge 47GB 7GB
Time 240s 58s
Labeled Point
• Dummy variable (1,0)
• Categorical variable (0, 1, 2, …)
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
DEMO
Descriptive Statistics
• Supported function:
- count
- max
- min
- mean
- variance
…
• Supported data types
- Dense
- Sparse
- Labeled Point
Example
from pyspark.mllib.stat import Statistics
from pyspark.mllib.linalg import Vectors
import numpy as np
## example data(2 x 2 matrix at least)
data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])
## to RDD
distData = sc.parallelize(data)
## Compute Statistic Value
summary = Statistics.colStats(distData)
print "Duration Statistics:"
print " Mean: {}".format(round(summary.mean()[0],3))
print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))
print " Max value: {}".format(round(summary.max()[0],3))
print " Min value: {}".format(round(summary.min()[0],3))
print " Total value count: {}".format(summary.count())
print " Number of non-zero values: {}".format(summary.numNonzeros()[0])
DEMO
Quiz 1
• Which statement is wrong purpose of descriptive statistic?
1. Find the extreme value of the data.
2. Understanding the data properties preliminary.
3. Describe the full properties of the data.
Correlation
https://commons.wikimedia.org/wiki/File:Linear_Correlation_Examples
Example
• Data Set
from pyspark.mllib.stat import Statistics
## 選擇統計方法
corrType='pearson'
## 建立測試資料及
group1 = sc.parallelize([1,3,4,3,7,5,6,8])
group2 = sc.parallelize([1,2,3,4,5,6,7,8])
corr = Statistics.corr(group1,group2, corrType)
print corr
# 0.87
A B C D E F G H
Case1 1 3 4 3 7 5 6 8
Case2 1 2 3 4 5 6 7 8
Shape is important
http://2012books.lardbucket.org/books/beginning-psychology/
DEMO
Quiz 2
• Which situation below can not use pearson
correlation?
1. When two variables have no correlation
2. When two variables have correlation of indices
3. When count of data rows > 30
4. When variables are normal distribution.
Clustering Model
K-Means
K-means
• K-means clustering aims to partition n observations
into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a
prototype of the cluster.
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Summary
https://en.wikipedia.org/wiki/K-means_clustering
K-Means: Python
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile("data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=10, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
DEMO
• Which data is not suitable for K-means analysis?
Quiz 3
Classification Model
Logistic Regression
linear regression
https://commons.wikimedia.org/wiki/File:Linear_regression.svg
When outcome is only 1/0
http://www.slideshare.net/21_venkat/logistic-regression-17406472
Hypotheses function
• hypotheses:
• sigmoid function:
• Maximum Likelihood estimation
Cost Function
if y = 0if y = 1 regularized
Sample Code
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithSGD.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
float(parsedData.count())
print("Training Error = " + str(trainErr))
Quiz 4
• What situation is not suitable for logistic regression?
1. data with multiple label point
2. data with no label
3. data with more the 1,000 features
4. data with dummy features
DEMO
Recommendation Model
Alternating Least Squares
Definition
• Collaborative Filtering(CF) is a subset of algorithms
that exploit other users and items along with their
ratings(selection, purchase information could be
also used) and target user history to recommend an
item that target user does not have ratings for.
• Fundamental assumption behind this approach is
that other users preference over the items could be
used recommending an item to the user who did not
see the item or purchase before.
Definition
• CF differs itself from content-based methods in the
sense that user or the item itself does not play a
role in recommeniation but rather how(rating) and
which users(user) rated a particular item.
(Preference of users is shared through a group of
users who selected, purchased or rate items
similarly)
Recommendation
Recommendation
Recommendation
Recommendation
ALS: Python
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
DEMO
Quiz 5
• Search for other recommendation algorithms, point
out the difference with ALS.
- Item Base
- User Base
- Content Base
Reference
• Machine Learning Library (MLlib) Guide
http://spark.apache.org/docs/1.4.1/mllib-guide.html
• MLlib: Spark's Machine Learning Library
http://www.slideshare.net/jeykottalam/mllib
• Recent Developments in Spark MLlib and Beyond
http://www.slideshare.net/Hadoop_Summit
• Introduction to Machine Learning
http://www.slideshare.net/rahuldausa/introduction-to-machine-
learning

Weitere ähnliche Inhalte

Was ist angesagt?

Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySparkSpark Summit
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Apples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
Apples and Oranges - Comparing Kafka Streams and Flink with Bill BejeckApples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
Apples and Oranges - Comparing Kafka Streams and Flink with Bill BejeckHostedbyConfluent
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumChengKuan Gan
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
 

Was ist angesagt? (20)

Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Apples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
Apples and Oranges - Comparing Kafka Streams and Flink with Bill BejeckApples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
Apples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 

Andere mochten auch

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regressionnszakir
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Choosing Regression Models
Choosing Regression ModelsChoosing Regression Models
Choosing Regression ModelsStephen Senn
 
Build your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by stepBuild your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by stepBryan Yang
 
Skiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strcturesSkiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strctureszukun
 
Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015Bryan Yang
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily LifeBryan Yang
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingKeira Zhou
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
DSP 資料科學計畫簡介
DSP 資料科學計畫簡介DSP 資料科學計畫簡介
DSP 資料科學計畫簡介codefortomorrow
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 

Andere mochten auch (17)

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
02 newton-raphson
02 newton-raphson02 newton-raphson
02 newton-raphson
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Choosing Regression Models
Choosing Regression ModelsChoosing Regression Models
Choosing Regression Models
 
Build your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by stepBuild your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by step
 
Skiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strcturesSkiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strctures
 
Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily Life
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark Streaming
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
DSP 資料科學計畫簡介
DSP 資料科學計畫簡介DSP 資料科學計畫簡介
DSP 資料科學計畫簡介
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 

Ähnlich wie Spark MLlib - Training Material

Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptxVirajPathania1
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsRobin Reni
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfssuser598883
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxSandeep Singh
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdfJulioRecaldeLara1
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxtangadhurai
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçĹn Yenigßn
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxRuby Shrestha
 

Ähnlich wie Spark MLlib - Training Material (20)

Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Ember
EmberEmber
Ember
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 

Mehr von Bryan Yang

敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法Bryan Yang
 
Data pipeline essential
Data pipeline essentialData pipeline essential
Data pipeline essentialBryan Yang
 
Docker 101
Docker 101Docker 101
Docker 101Bryan Yang
 
資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥Bryan Yang
 
Data pipeline 101
Data pipeline 101Data pipeline 101
Data pipeline 101Bryan Yang
 
Building a data driven business
Building a data driven businessBuilding a data driven business
Building a data driven businessBryan Yang
 
產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例Bryan Yang
 
Serverless ETL
Serverless ETLServerless ETL
Serverless ETLBryan Yang
 
敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法Bryan Yang
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to dockerBryan Yang
 

Mehr von Bryan Yang (10)

敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法
 
Data pipeline essential
Data pipeline essentialData pipeline essential
Data pipeline essential
 
Docker 101
Docker 101Docker 101
Docker 101
 
資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥
 
Data pipeline 101
Data pipeline 101Data pipeline 101
Data pipeline 101
 
Building a data driven business
Building a data driven businessBuilding a data driven business
Building a data driven business
 
產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例
 
Serverless ETL
Serverless ETLServerless ETL
Serverless ETL
 
敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to docker
 

KĂźrzlich hochgeladen

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 

KĂźrzlich hochgeladen (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Spark MLlib - Training Material

  • 2. • Experience Vpon Data Engineer TWM, Keywear, Nielsen • Bryan’s notes for data analysis http://bryannotes.blogspot.tw • Spark.TW • Linikedin https://tw.linkedin.com/pub/bryan-yang/7b/763/a79 ABOUT ME
  • 3. Agenda • Introduction to Machine Learning • Why MLlib • Basic Statistic • K-means • Logistic Regression • ALS • Demo
  • 5. Definition • Machine learning is a study of computer algorithms that improve automatically through experience.
  • 6. Terminology • Observation - The basic unit of data. • Features - Features is that can be used to describe each observation in a quantitative manner. • Feature vector - is an n-dimensional vector of numerical features that represents some object. • Training/ Testing/ Evaluation set - Set of data to discover potential predictive relationships
  • 7. Learning(Training) • Features: 1. Color: Red/ Green 2. Type: Fruit 3. Shape: nearly Circle etc…
  • 8. Learning(Training) ID Color Type Shape is apple ( Label) 1 Red Fruit Cirle Y 2 Red Fruit Cirle Y 3 Black Logo nearly Cirle N 4 Blue N/A Cirle N
  • 9. Categories of Machine Learning • Classification: predict class from observations. • Clustering: group observations into meaningful groups. • Regression: predict value from observations.
  • 10. Cluster the observations with no Labels https://en.wikipedia.org/wiki/Cluster_analysis
  • 12. Find a model to describe the observations https://commons.wikimedia.org/wiki/File:Linear_regression.svg
  • 15. What is MLlib • MLlib is an Apache Spark component focusing on machine learning: • MLlib is Spark’s core ML library • Developed by MLbase team in AMPLab • 80+ contributions from various organization • Support Scala, Python, and Java APIs
  • 16. Algorithms in MLlib • Statistics: Description, correlation • Clustering: k-means • Collaborative filtering: ALS • Classification: SVMs, naive Bayes, decision tree. • Regression: linear regression, logistic regression • Dimensionality: SVD, PCA
  • 17. Why Mllib • Scalability • Performance • user-friendly documentation and APIs • Cost of maintenance
  • 19. Data Type • Dense vector • Sparse vector • Labeled point
  • 20. Dense & Sparse • Raw Data: ID A B C D E F 1 1 0 0 0 0 3 2 0 1 0 1 0 2 3 1 1 1 0 1 1
  • 21. Dense vs Sparse • Training Set: - number of example: 12 million - number of features: 500 - sparsity: 10% • Not only save storage, but also received a 4x speed up Dense Sparse Storge 47GB 7GB Time 240s 58s
  • 22. Labeled Point • Dummy variable (1,0) • Categorical variable (0, 1, 2, …) from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint # Create a labeled point with a positive label and a dense feature vector. pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) # Create a labeled point with a negative label and a sparse feature vector. neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
  • 23. DEMO
  • 24. Descriptive Statistics • Supported function: - count - max - min - mean - variance … • Supported data types - Dense - Sparse - Labeled Point
  • 25. Example from pyspark.mllib.stat import Statistics from pyspark.mllib.linalg import Vectors import numpy as np ## example data2 x 2 matrix at least) data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]]) ## to RDD distData = sc.parallelize(data) ## Compute Statistic Value summary = Statistics.colStats(distData) print "Duration Statistics:" print " Mean: {}".format(round(summary.mean()[0],3)) print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3)) print " Max value: {}".format(round(summary.max()[0],3)) print " Min value: {}".format(round(summary.min()[0],3)) print " Total value count: {}".format(summary.count()) print " Number of non-zero values: {}".format(summary.numNonzeros()[0])
  • 26. DEMO
  • 27. Quiz 1 • Which statement is wrong purpose of descriptive statistic? 1. Find the extreme value of the data. 2. Understanding the data properties preliminary. 3. Describe the full properties of the data.
  • 29. Example • Data Set from pyspark.mllib.stat import Statistics ## 選擇統計方法 corrType='pearson' ## 建立測試資料及 group1 = sc.parallelize([1,3,4,3,7,5,6,8]) group2 = sc.parallelize([1,2,3,4,5,6,7,8]) corr = Statistics.corr(group1,group2, corrType) print corr # 0.87 A B C D E F G H Case1 1 3 4 3 7 5 6 8 Case2 1 2 3 4 5 6 7 8
  • 31. DEMO
  • 32. Quiz 2 • Which situation below can not use pearson correlation? 1. When two variables have no correlation 2. When two variables have correlation of indices 3. When count of data rows > 30 4. When variables are normal distribution.
  • 34. K-means • K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
  • 42. K-Means: Python from pyspark.mllib.clustering import KMeans, KMeansModel from numpy import array from math import sqrt # Load and parse the data data = sc.textFile("data/mllib/kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
  • 43. DEMO
  • 44. • Which data is not suitable for K-means analysis? Quiz 3
  • 47. When outcome is only 1/0 http://www.slideshare.net/21_venkat/logistic-regression-17406472
  • 49. • Maximum Likelihood estimation Cost Function if y = 0if y = 1 regularized
  • 50. Sample Code from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.regression import LabeledPoint from numpy import array # Load and parse the data def parsePoint(line): values = [float(x) for x in line.split(' ')] return LabeledPoint(values[0], values[1:]) data = sc.textFile("data/mllib/sample_svm_data.txt") parsedData = data.map(parsePoint) # Build the model model = LogisticRegressionWithSGD.train(parsedData) # Evaluating the model on training data labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features))) trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print("Training Error = " + str(trainErr))
  • 51. Quiz 4 • What situation is not suitable for logistic regression? 1. data with multiple label point 2. data with no label 3. data with more the 1,000 features 4. data with dummy features
  • 52. DEMO
  • 54. Definition • Collaborative Filtering(CF) is a subset of algorithms that exploit other users and items along with their ratings(selection, purchase information could be also used) and target user history to recommend an item that target user does not have ratings for. • Fundamental assumption behind this approach is that other users preference over the items could be used recommending an item to the user who did not see the item or purchase before.
  • 55. Definition • CF differs itself from content-based methods in the sense that user or the item itself does not play a role in recommeniation but rather how(rating) and which users(user) rated a particular item. (Preference of users is shared through a group of users who selected, purchased or rate items similarly)
  • 60.
  • 61. ALS: Python from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating # Load and parse the data data = sc.textFile("data/mllib/als/test.data") ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) # Build the recommendation model using Alternating Least Squares rank = 10 numIterations = 20 model = ALS.train(ratings, rank, numIterations) # Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean() print("Mean Squared Error = " + str(MSE))
  • 62. DEMO
  • 63. Quiz 5 • Search for other recommendation algorithms, point out the difference with ALS. - Item Base - User Base - Content Base
  • 64. Reference • Machine Learning Library (MLlib) Guide http://spark.apache.org/docs/1.4.1/mllib-guide.html • MLlib: Spark's Machine Learning Library http://www.slideshare.net/jeykottalam/mllib • Recent Developments in Spark MLlib and Beyond http://www.slideshare.net/Hadoop_Summit • Introduction to Machine Learning http://www.slideshare.net/rahuldausa/introduction-to-machine- learning