Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Learning spark ch11 - Machine Learning with MLlib
1. C H A P T E R 1 1 : M A C H I N E L E A R N I N G W I T H M L L I B
Learning Spark
by Holden Karau et. al.
2. Overview: Machine Learning with MLlib
System Requirements
Machine Learning Basics
Data Types
Algorithms
Feature Extraction
Statistics
Classification and Regression
Clustering
Collaborative Filtering and Recommendation
Dimensionality Reduction
Model Evaluation
Tips and Performance Considerations
Pipeline API
Conclusion
3. 11.1 Overview
MLlib’s design and philosophy are simple: it lets you
invoke various algorithms on distributed datasets,
representing all data as RDDs.
It contains only parallel algorithms that run well on
clusters
In Spark 1.0 and 1.1, MLlib’s interface is relatively
low-level
In Spark 1.2, MLlib gains an additional pipeline API
for building such pipelines.
4. 11.2 System Requirements
MLlib requires some linear algebra libraries to be
installed on your machines.
gfortran runtime library
to use MLlib in Python, you will need NumPy
python-numpy or numpy package through your package manager
on Linux
or by using a third-party scientific Python installation like
Anaconda.
5. Edx and Coursera Courses
Introduction to Big Data with Apache Spark
Spark Fundamentals I
Functional Programming Principles in Scala
6. 11.3 Machine Learning Basics
Machine learning algorithms attempt to make
predictions or decisions based on training data.
All learning algorithms require defining a set of features
for each item
Most algorithms are defined only for numerical features
specifically, a vector of numbers representing the value for each
feature
Once data is represented as feature vectors, most
machine learning algorithms optimize a well-defined
mathematical function based on these vectors
Finally, most learning algorithms have multiple
parameters that can affect results
8. 11.4 Data Types
MLlib contains a few specific data types, located in
the:
org.apache.spark.mllib package (Java/Scala)
pyspark.mllib (Python).
The main ones are:
Vector
LabeledPoint
Rating
Various Model classes
9. 11.5 Algorithms
Feature Extraction
mllib.feature package
Statistics
mllib.stat.Statistics class.
Classification and Regression
use the LabeledPoint class (resides in the mllib.regression package.)
Clustering
K-means, as well as a variant called K-means||
Collaborative Filtering and Recommendation
mllib.recommendation.ALS class
Dimensionality Reduction
Model Evaluation
10. 11.5.1 Feature Extraction
TF-IDF (Term Frequency–Inverse Document Frequency)
computes two statistics for each term in each document:
the term frequency (TF)
the inverse document frequency (IDF)
MLlib has two algorithms that compute TF-IDF: HashingTF and IDF
Scaling
Normalization
Word2Vec
Collaborative Filtering and Recommendation
Dimensionality Reduction
Model Evaluation
11. 11.5.2 Statistics
MLlib offers several widely used statistic functions
that work directly on RDDs
Statistics.colStats(rdd)
Statistics.corr(rdd, method)
Statistics.corr(rdd1, rdd2, method)
Statistics.chiSqTest(rdd)
Apart from these methods, RDDs containing
numeric data offer several basic statistics such as
mean(), stdev(), and sum()
RDDs support sample() and sampleByKey() to build
simple and stratified samples of data.
12. 11.5.3 Classification and Regression
Classification and regression are two common forms
of supervised learning
The difference between them:
in classification, the variable is discrete
in regression, the variable predicted is continuous
MLlib includes a variety of methods:
Linear regression
Logistic regression
Support Vector Machines
Naive Bayes
Decision trees and random forests
13. 11.5.4 Clustering
Clustering is the unsupervised learning task that
involves grouping objects into clusters of high
similarity
MLlib includes the popular K-means algorithm for
clustering, as well as a variant called K-means||
Kmeans|| is similar to the K-means++ initialization
procedure often used in singlenode settings.
To invoke K-means:
create a mllib.clustering.KMeans object (in Java/Scala)
or calling KMeans.train (in Python).
14. 11.5.5 Collaborative Filtering and
Recommendation
Collaborative filtering:
is a technique for recommender systems
is attractive
MLlib includes an implementation of Alternating
Least Squares (ALS)
It is located in the mllib.recommendation.ALS class.
To use ALS, you need to give it an RDD of
mllib.recommendation.Rating objects
there are two variants of ALS: for explicit ratings (the default)
and for implicit ratings
15. 11.5.6 Dimensionality Reduction
Principal component analysis (PCA)
the mapping to the lower-dimensional space is done such that the
variance of the data in the lower-dimensional representation is
maximized,
PCA is currently available only in Java and Scala (as of MLlib 1.2).
Singular value decomposition (SVD)
The SVD factorizes an m × n matrix A into three matrices A ≈ UΣVT:
U is an orthonormal matrix, whose columns are called left singular
vectors.
Σ is a diagonal matrix with nonnegative diagonals in descending
order, whose diagonals are called singular values.
V is an orthonormal matrix, whose columns are called right singular
vectors.
16. 11.5.7 Model Evaluation
In Spark 1.2, MLlib contains an experimental set of
model evaluation functions, though only in Java and
Scala.
In future versions of Spark, the pipeline API is
expected to include evaluation functions in all
languages.
17. 11.6 Tips and Performance Considerations
Preparing Features
Scale your input features.
Featurize text correctly.
Label classes correctly.
Configuring Algorithms
Caching RDDs to Reuse
try persist(StorageLevel.DISK_ONLY).
Recognizing Sparsity
Level of Parallelism
18. 11.7 Pipeline API
Starting in Spark 1.2
This API is similar to the pipeline API in SciKit-Learn.
a pipeline is a series of algorithms (either feature
transformation or model fitting) that transform a dataset.
Each stage of the pipeline may have parameters
The pipeline API uses a uniform representation of
datasets throughout, which is SchemaRDDs from
Spark SQL
The pipeline API is still experimental at the time of
writing
19. Edx and Coursera Courses
Introduction to Big Data with Apache Spark
Spark Fundamentals I
Functional Programming Principles in Scala
20. 11.8 Conclusion
the library ties directly to Spark’s other APIs
letting you work on RDDs and get back results you
can use in other Spark functions.
MLlib is one of the most actively developed parts of
Spark, so it is still evolving.
Hinweis der Redaktion
Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster.
First, all libraries and higher- level components in the stack benefit from improvements at the lower layers.
Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one.
Finally, one of the largest advantages of tight integration is the ability to build appli‐ cations that seamlessly combine different processing models.