This document provides an overview of Apache Spark's MLlib machine learning library. It discusses machine learning concepts and terminology, the types of machine learning techniques supported by MLlib like classification, regression, clustering, collaborative filtering and dimensionality reduction. It covers MLlib's algorithms, data types, feature extraction and preprocessing capabilities. It also provides tips for using MLlib such as preparing features, configuring algorithms, caching data, and avoiding overfitting. Finally, it introduces ML Pipelines for constructing machine learning workflows in Spark.
2. Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and Machine Learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model
7) Deployment
3. Table of contents
● Machine Learning Introduction
● Spark ML Support - MLlib
● Machine Learning Techniques
● Tips & Considerations
● ML Pipelines
● Q & A
4. Machine Learning
● Subfield of Artificial Intelligence (AI)
● Construction & Study of systems that can learn from
data
● Computers act without being explicitly programmed
● Can be seen as building blocks to make computers
behave more intelligently
6. Terminology
● Features
o each item is described by number of features
● Samples
o sample is an item to process
o document, picture, row in db, graph, ...
● Feature vector
o n-dimensional vector of numerical features representing some sample
● Labelled data
o data with known classification results
8. Categories
● Supervised learning
o labelled data are available
● Unsupervised learning
o No labelled data is available
● Semi-supervised learning
o mix of Supervised and Unsupervised learning
o usually small part of data is labelled
● Reinforcement learning
o model is continuously learn and relearn based on the actions and the
effects/rewards from that actions.
o reward feedback
9. Applications
● Speech recognition
● Effective web search
● Recommendation systems
● Computer vision
● Information retrieval
● Spam filtering
● Computational finance
● Fraud detection
● Medical diagnosis
● Stock market analysis
● Structural health monitoring
● ...
11. Benefits of MLlib
● Part of Spark
● Integrated workflow
● Scala, Java & Python API
● Broad coverage of applications & algorithms
● Rapid improvements in speed & robustness
● Ongoing development & Large community
● Easy to use, well documented
14. Data Types
● Vector
o both dense and sparse vectors
● LabeledPoint
o labelled data point for supervised learning
● Rating
o rating of a product by a user, used for recommendation
● Various Models
o result of a training algorithm
o used for predicting unknown data
● Matrices
15. Feature Extraction & Basic Statistics
● Several classes for common operations
● Scaling, normalization, statistical summary, correlation, …
● Numeric RDD operations, sampling, …
● Random generators
● Words extractions (TF-IDF)
o generating feature vectors from text documents/web pages
16. Classification
● Classify samples into predefined category
● Supervised learning
● Binary classification (SVMs, logistic regression)
● Multiclass Classification (decision trees, naive Bayes)
● Spam x non-spam, fruit x logo, ...
17. Regression
● Predict value from observations, many techniques
● Predicted values are continuous
● Supervised learning
● Linear least squares, Lasso, ridge regression, decision trees
● House prices, stock exchange, power consumption, height of person, ...
18. Linear Regression Example
● Method run trains model
● Parameters are set with setters setNumInterations and setIntercept
● Stochastic Gradient Descent (SGD) algorithm is used for minimizing function
19. Clustering
● Grouping objects into groups (~ clusters) of high similarity
● Unsupervised learning -> groups are not predefined
● Number of clusters must be defined
● K-means, Gaussian Mixture Model (EM algorithm), Power Iteration
Clustering (PIC), Latent Dirichlet Allocation(LDA)
20. Collaborative Filtering
● Used for recommender systems
● Creates and analyses matrix of ratings, predicts missing entries
● Explicit (given rating) vs implicit (views, clicks, likes, shares, ...) feedback
● Alternating least squares (ALS)
21. Dimensionality Reduction
● Process of reducing number of variables under consideration
● Performance needs, removing non-informative dimensions, plotting, ....
● Principal Component Analysis (PCA) - ignoring non-informative dims
● Singular Value Decomposition (SVD)
o factorizes matrix into 3 descriptive matrices
o storage save, noise reduction
22. Tips
● Preparing features
o each algorithm is only as good as input features
o probably the most important step in ML
o correct scaling, labeling for each algorithm
● Algorithm configuration
o performance greatly varies according to params
● Caching RDD for reuse
o most of the algorithms are iterative
o input dataset should be cached (cache() method) before passing into
MLlib algorithm
● Recognizing sparsity
23. Overfitting
● Model is overtrained to the testing data
● Model describes random errors or noise instead of underlying relationship
● Results in poor predictive performance
24. Data Partitioning
● Supervised learning
● Partitioning labelled data
● Labelled data
o Training set
set of samples used for learning
experiments with algorithm parameters
o Test set
testing fitted model
must not tune model any further
● Common separation - 70/30
29. Pipeline API
● Pipeline is a series of algorithms (feature transformation, model fitting, ...)
● Easy workflow construction
● Distribution of parameters into each stage
● MLlib is easier to use
● Uses uniform dataset representation - SchemaRDD from SparkSQL
○ multiple named columns (similar to SQL table)
31. Conclusion
● What is Machine Learning
● Machine Learning Use Cases & Techniques
● Spark’s Machine Learning library - MLlib
● Tips for using MLlib and Spark
"Reinforcement learning (RL) and supervised learning are usually portrayed as distinct methods of learning from experience. RL methods are often applied to problems involving sequential dynamics and optimization of a scalar performance objective, with online exploration of the effects of actions. Supervised learning methods, on the other hand, are frequently used for problems involving static input-output mappings and minimization of a vector error signal, with no explicit dependence on how training examples are gathered. As discussed by Barto and Dietterich (this volume), the key feature distinguishing RL and supervised learning is whether training information from the environment serves as an evaluation signal or as an error signal…"
spark-1.3.0-snapshot
“Term Frequency—Inverse Document Frequency, or TF-IDF, is a simple way to generate feature vectors from text documents (e.g. web pages). It computes two statistics for each term in each document: the term frequency, TF, which is the number of times the term occurs in that document, and the inverse document frequency, IDF, which measures how (in)frequently a term occurs across the whole document corpus. The product of these values, TF \times IDF, shows how relevant a term is to a specific document (i.e. if it is common in that specific document but rare in the whole corpus).”
logistic regression -> datas are labeled 1 or 0 -> classification
A large number of procedures have been developed for parameter estimation and inference in linear regression. These methods differ in computational simplicity of algorithms, presence of a closed-form solution, robustness with respect to heavy-tailed distributions, and theoretical assumptions needed to validate desirable statistical properties such as consistency and asymptotic efficiency.
http://www.datasciencecentral.com/profiles/blogs/10-types-of-regressions-which-one-to-use
http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Linear least squares is one of the mathematics/statistical problem solving methods, using least squares algorithmic technique to increase solution approximation accuracy, corresponding with a particular problem's complexity:
lasso (least absolute shrinkage and selection operator) - version of least squares
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/mllib-optimization.html
Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning.[1][2]
SGD is a great general-purpose optimization algorithm, and it is easy to implement. I would generally use it first, before trying something more complicated. I believe SGD is just as good as, if not superior, to L-BFGS in the not highly varying (and sometimes even convex) optimization surfaces common in current NLP models. (I would nonetheless be interested in a controlled comparison between SGD and the L-BFGS using the Berkeley cache-flushing trick.)
https://github.com/apache/spark/blob/master/docs/mllib-clustering.md
The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a graph given pairwise similarties as edge properties, described in Lin and Cohen, Power Iteration Clustering. It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via power iteration and uses it to cluster vertices.
Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents. LDA can be thought of as a clustering algorithm as follows:
Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
Rather than estimating a clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated.
LDA takes in a collection of documents as vectors of word counts. It learns clustering using expectation-maximizationon the likelihood function. After fitting on the documents, LDA provides:
Topics: Inferred topics, each of which is a probability distribution over terms (words).
Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.