Parallel Machine Learning

Parallel Machine Learning
Janani Chakkaradhari
Information Technology for Business Intelligence
Technische Universit¨ t Berlin
a
February 13, 2014
Abstract
Scalability has been an essential factor for any kind of computational algorithm while considering its performance. In this Big Data era, gathering of large
amounts of data is becoming easy. Data analysis on Big Data is not feasible using
the existing Machine Learning (ML) algorithms and it perceives them to perform
poorly. This is due to the fact that the computational logic for these algorithms is
previously designed in sequential way. MapReduce [1] becomes the solution for
handling billions of data efficiently. In this report we discuss the basic building
block for the computation behind ML algorithms, two different attempts to parallelize machine learning algorithms using MapReduce and a brief description on
the overhead in parallelization of ML algorithms.

1

Introduction

The significance of Machine Learning algorithms are widely known and its acquaintance in various applications brings in much more benefits in business as well as in
research community. In traditional ML algorithms, the computational methods were
built by thinking the data fits in memory. On the other hand, the current distributed
infrastructure of Information Systems (IS) facilitates the computerized society to easily access and also generate data in almost every action involved in their day to-day
life. This perpetual increase of data leads to degrade in performance of ML algorithms
which had been proved to produce fast and prominent results with smaller datasets
which in turn becomes the cause for “curse of modularity” [9].
With the advent of MapReduce programming model, data voluminous is handled
efficiently in parallel as it follows divide and conquer methodology for execution.
“Learning can become limited by computation time and not by data volume with help
of MapReduce and large clusters of machines” [8] and this imposes the fact that ML
algorithms has to be re-modified in order to be executed in parallel architecture.
Thus parallelization of ML algorithms using MapReduce model would results in
increase in speed of computation. Earlier works on this topic had been proved to produce increased performance. This report presents a gentle background study on the
exploitation of Linear Algebra in ML in section 2, followed by an overview of one
of the novel approach for parallelization of Stochastic Gradient Descent algorithm for
Matrix Factorization [2] in section 3, and a brief summary on declarative ML which is
an attempt to provide a declarative way of executing some of the ML algorithms and
linear algebra primitives on Hadoop using a system called SystemML [3] in section 4.

1

2

Computational Engine for Machine Learning

Mathematics and computer science are like the tracks of a train, they always go together
to make sure a good journey for real world users. Linear algebra has prominent role
in ML. Transforming problem space into linear functions is one of the elementary
approaches used in predictive algorithms. Matrices are used as means of representing
linear functions. In other words, the interaction between two entities of a system can be
represented in two dimensional form known as matrix. The elements inside the matrix
represents the magnitude of those interactions between two ﬁnite set of objects also
known as dyadic data [4]. Analysis of the system using matrix technique allows one to
predict the effect of individual interactions on the overall system. Some of the eminent
applications in ML based on linear algebra are listed below,
• Singular Value Decomposition (SVD) is one of famous method for its applications in image compression, determining oscillations or damages in structures
like bridge during the design phase and many more.
• Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA)
are used as a feature extraction step before classiﬁcation.
• Eigen value and Eigen vectors has its proven results in PageRank algorithm.
• Analysis based on dyads such as topic modeling, keyword search and recommender systems are based on Non Negative Matrix Factorization technique [6].

3

Large Scale Matrix Factorization with DSGD

In this section, an overview of Distributed Stochastic Gradient Descent algorithm is
described with a brief review on optimization of Matrix Factorization using Stochastic
Gradient Descent and a quick introduction to functional usage of Matrix Factorization
and Stochastic Gradient Descent.

3.1

Matrix Factorization

Matrix Factorization is mainly used to extract interaction structure from dyadic data
[6]. The interaction structure includes the following [4]
• Co-occurrence
• Strength of preference or the association
• Word clustering, word sense disambiguation and thesaurus construction in text
based information retrieval
• Modeling of preference and consumption behavior
• The dyad in computer vision applications represents the feature observed at a
particular image location.

2

3.2

Stochastic Gradient Descent (SGD)

Gradient descent has fruitful applications in optimization problems. It predominantly
helps in minimizing the cost function of ML algorithms such as linear regression where
the weight vector or the parameter vector is determined by minimizing the average of
sum of square errors between the predictions minus the actual values in the training set
[7].
One main drawback of gradient descent is that it requires all the training data set
for computing the average square error in each step of updating parameter vector and
repeats this process until the parameter vector converges. This slows down the speed
of algorithm. It is also termed as Batch Gradient Descent.
In contrast, Stochastic Gradient Descent takes single training data at a time randomly and updates the parameter vector with respect to that training data in each step
and repeats the process until it converges. So this eliminates the need to look at the entire data set in each step and scans the entire training set for repetition of the algorithm.

3.3

Stochastic Gradient Descent for Matrix Factorization

Matrix Factorization helps to reconstruct the original matrix from the partially observed
matrix using some approximation technique. For example in the Netflix matrix problem of recommendation [5], the rows represent the user and columns represents the
movie. The matrix is partially filled with user ratings given to the movies. By considering the existing rating values, Matrix Factorization tries to find the missing values. In
simplest form, this can be done by associating each user and each movie some numbers
(factors) such that the product of these two numbers would be close as possible as the
original rating.
The discrepancies between the original input matrix and product of the factors here
is the cost function. We would try to reduce this cost function to get the most appropriate factors. One way to do this, is by employing Stochastic Gradient Descent
algorithm and SGD usually produces greater performance results in sequential execution. Since SGD approximation would end up with noisy values the cost function in
here includes regularization and other informations along with prediction error. SGD
tries to minimize sum of all losses in the entire matrix. SGD works as follows [2],
• Step 1: Takes a random entry from the training set
• Step 2: Evaluate loss function
• Step 3: Update parameter spaces
• Step 4: Repeat Step 1 to 3 for all the entries in the matrix
We can not run this algorithm in parallel using MapReduce. The reason is the
following, each mapper runs SGD on the subsets of large matrix. It reads current
row and current column of the subset, evaluates local loss function and updates the
parameters (i.e. the rows and columns) of the corresponding matrix subset. As we
considered SGD runs in parallel, it could be possible for the algorithm to be executed
on another subset of the matrix which is dependent (the same column but different
row). This deliberately leads the second mapper to read the values that are updated by
the first mapper at the same time. So this makes the algorithm not to run in parallel
architecture.

3

As described by Gemulla [2], not all the subsets are dependents in the matrix. In
Most of the cases the subsets are completely independent to each other so that it could
be possible to run SGD by locking the rows and columns of that subset. This idea
forms the basis for parallelized SGD.

3.4

Distributed SGD for Matrix Factorization (DSGD)

DSGD utilizes the concept of independent rows and columns. Suppose if we have d
number of nodes in the cluster, we split the input matrix (the training set of known
ratings) into d ¢ d smaller matrices and distribute the smaller matrix into the d blocks
such that the each node has the blocks of entire row as shown in the Figure 1.

Figure 1: Example Stratum of 3 Cluster nodes
The interchangeable sub matrices is called stratum basically represents a partition
of the underlying matrix dataset. In the paper [2], the stratification is performed by
permutation such that d nodes has the possible independent block combinationsd!. For
example 3 nodes have 6 possible stratums and this 6 stratums forms a single sequence
of stratra. The DSGD algorithm works as follows, Assuming there are d nodes available, Z is training set input matrix, W and H are the parameter factors of the input
matrix.
• Step 1: Divide the input matrix to Z into dd and distribute it over the clusters. H
and W parameters are equally distributed on d blocks on rows and columns such
that W with d ¢ 1 and H with 1 ¢ d dimensions. Compute the strata sequence for
the input blocks using permutations. For each stratum in the strata, do step 2 and
step 3
• Step 2: Select a stratum that are independent, for example the blocks along the
diagonal the red boxes as shown in the figure from the sequence of strata (all
possible combinations of stratum).
• Step 3: Run SGD on the selected blocks in parallel to find the local minimum
for loss function. Sum up the results of local losses computed at each block and
update the corresponding factor matrices W and H
This is how DSGD runs SGD algorithm in a distributed manner within a stratum.
DSGD outperforms ALS (Alternating Least Squares) method for matrix factorization
[2]. Since DSGD avoid averaging over loss functions when executed in parallel which
makes the algorithm simpler and versatile

4

4

Declarative Machine Learning: SystemML

The overhead in parallelizing ML algorithms can be easily understood by simple SGD
algorithm as we discussed in previous section. This makes a very clear argument that
the researchers have to carefully analyze each sequentially powerful ML algorithm to
make it parallel and to be executed in MapReduce programming model. The cost of implementing as MapReduce jobs is high and also for better performance sometimes the
same algorithm has to be hand tuned. Hence there is no space for the discussion of optimization in MapReduce jobs. For example in case of matrix multiplication problem,
the order execution of multiplication has higher performance impact [3]. Researchers
from IBM Almaden and Watson research center has proposed a new approach for handling parallelization of ML algorithms which also considers optimization into account
and it is called SystemML.
SystemML is analogous to HiveQL developed by Facebook for executing data
warehouse queries on large clusters where the queries are converted to MapReduce jobs
which will be executed on Hadoop by the HiveQL engine. Similarly SystemML provides a declarative platform for expressing ML algorithms and linear algebra primitives
and converts the abstract representation into executable MapReduce jobs on Hadoop.

4.1

Application areas of SystemML

In SystemML, ML algorithms are expressed in High Level Language called Declarative Machine Learning (DML) which is comparable to R. DML supports operations
such as transpose of a matrix, matrix multiplication, iterative algorithms using “for”
and “while” constructs and soon. So this makes user to focus on writing scripts that
answers to what constructs to use for computation rather than how to express computation. SystemML is highly scalable and efficiently tunes the performance. It is
used in different fields such as predictive modeling, recommender systems, and search
analysis.

4.2

System Architecture of SystemML

SystemML takes the DML script as input and passes through the different components
[3] and results in parsed representation of the initial script. It supports built in data
types for representing matrices and scalars. The first step in SystemML is Identifying
the statement blocks based on the constructs that breaks the sequential flow of DML
program. For each statement block it does the following,

4.3

High level Operator (HOP)

HOP component analysis consumes and results in the following input and output.
Input: Parsed statement blocks
Action: The computation in each statement block instantiates one HOP Dag (Directed Acyclic Graph). HOP Dag represents the basic operations on Matrices and scalar
such as an operation or transformation.
Optimizations: Algebraic rewrites, selection of physical representation for intermediate matrices and cost based optimizations
Output: High level execution plan (HOP Dags) representing dataflow

5

4.4

Low level Operator (LOP)

LOP component analysis is following by HOP and the corresponding input and output
are as follows,
Input: High level execution plan (HOP Dags)
Action: HOP Dags are converted into Low level physical plans (LOP Dags) that
can be executed as MapReduce jobs. HOP Dags are parsed from bottom to top. Each
HOP Dag is converted into one or more LOP Dags. The input and the output formats
of each LOP is key value pairs. Since single computation leads to multiple LOPs,
SystemML tries to combine these LOPs to fit into a single MapReduce job. This is implemented by using a novel algorithm named piggybacking which reduces the number
of scans performed on input data during the execution of MR jobs. This is described in
section
Output: Low level execution plan (LOP Dags)

4.5

Runtime

The runtime makes sure that the input matrices are represented as key value pairs by
disregarding the cells without a value in the matrices and by that way it reduces the size
of input matrix representation as they are inherently sparse. SystemML collects the
local sparsity information by employing blocking operation on the input matrix. The
input matrix is divided into smaller matrices called blocks and each block is represented
with a block id and the cells represent the block value along with parameter indicating
whether the block is dense or sparse. The block size has major impact on generated
number of key value pairs by runtime [3].
Generic MapReduce Job (GM-R) is the main Execution engine in System ML and
it is instantiated by the Piggybacking algorithm (Multiple LOPs inside single MR jobs)
Control Module helps in coordinating the execution MapReduce jobs and involved
in computations such as arithmetic operations, predictive evaluations and soon. Multiple optimizations are performed in the runtime component (dynamically deciding
based on data characteristics)

4.6

Piggybacking

This algorithm packages multiple LOPs in the SystemML into a single MapReduce job
by considering the execution locations of each LOP at runtime. The execution location
identifies whether a LOP operation can be executed in Map or Reduce or it requires
both Map and Reduce for complete execution of the operation. 2 shows the list of
different LOP operations and their corresponding execution location. For example the
group operation of LOP has to be executed on both Map and Reduce phase and so it is
marked as MapAndReduce.
We consider the following example in 3 to layout the logic behind piggybacking
algorithm. The left part of the diagram represents the LOP Dag for a matrix multiplication of matrix W with its transpose. LOP Dags are parsed from bottom up fashion. The
algorithm starts by sorting LOP operations in topological order and the result of sort is
represented in center of the diagram. The algorithm works iteratively where it creates
a new MR job at the beginning of each iteration. The order of assigning each LOP
into the MR job is as follows, it first assigns the LOPs that only requires Map phase
indicated by Map or Reduce location in 2 followed by assigning LOPs that needs both
MapAndReduce phases and finally ends by assigning LOPs that requires only Reduce

6

Figure 2: Execution locations of LOP from [3]
phase. The algorithm makes sure that another descendant with execution location of
MapAndReduce will not be assigned to the same job.

Figure 3: Example Piggybacking
In our example since Data W and Transform LOPs spans only Map or Reduce
operation, it is assigned to the Map of first MR job. mmcj is the first LOP that spans
Map and Reduce phases, it is assigned to the both Map and Reduce phases of first MR
job. Since the first MR job is already has a LOP with location MapAndReduce, the
Group LOP which also has the same location of execution can not be assigned to the
first MR job. Hence the iteration ends and the next iteration start by instantiating the
second new MR job. Finally, Group and Aggregation operations are assigned to this
second MR job which completes the piggy backing algorithm in this examples.

5

Conclusion

In this report we have seen the requirements and the importance of research works in
the parallelization of ML algorithms and the role of the branch of mathematics, Linear
Algebra in ML algorithms. The realization of the level of difficulty in parallelizing ML
algorithms is covered by explaining a novel approach employed by DSGD algorithm
which is an effort to parallelize SGD for large clusters of data. Moreover we also
discussed about SystemML which provides an easier declarative platform for executing
ML algorithms to the users in different fields.
Even though SystemML is concise and provides user friendly platform for executing limited forms of ML algorithms and some linear algebra primitives such as matrix
multiplication, arithmetic operations and MF, DML does not support more complex
7

features of object oriented paradigm. It also does not support data structures such as
Arrays and Lists that are frequently used in most of the ML algorithms instead this is
possible in R, a language that provides a comprehensive set of ﬂexible constructs statistical and ML algorithms. On the other hand, Apache Mahout also provides complete
set of ML algorithms that are Hadoop based packages but it still needs to be hand tuned
for different data sets and it is more complex in users perspective.

References
[1] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simpliﬁed data processing on
large clusters. Communications of the ACM, 51(1):107–113, 2008.
[2] Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. Large-scale
matrix factorization with distributed stochastic gradient descent. In Proceedings
of the 17th ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 69–77. ACM, 2011.
[3] Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar
Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In Data
Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 231–
242. IEEE, 2011.
[4] Thomas Hofmann, Jan Puzicha, and Michael I Jordan. Learning from dyadic data.
Advances in neural information processing systems, pages 466–472, 1999.
[5] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques
for recommender systems. Computer, 42(8):30–37, 2009.
[6] Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, and Yi-Min Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on
mapreduce. In Proceedings of the 19th international conference on World wide
web, pages 681–690. ACM, 2010.
[7] Andrew Ng. Cs229 lecture notes. CS229 Lecture notes, 1(1):1–3, 2000.
[8] Tutorial on Modeling with Hadoop in KDD2011 by Vijay Narayanan and Milind
Bhandarkar. Modeling with hadoop.
[9] Charles Parker. Unexpected challenges in large scale machine learning. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, pages 1–6. ACM, 2012.

8

Parallel Machine Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie Parallel Machine Learning

Ähnlich wie Parallel Machine Learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Parallel Machine Learning