SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Challenges in Large Scale
Machine Learning
Sudarsun Santhiappan
RISE Lab
IIT Madras
Disclaimer: The slide contents are retreived from the free Internet
but cleverly edited, sequenced and stitched for academic lecturing purposes.
2
Why all the hype?
Max Welling, ICML 2014
Big Models Big Data
Peter Norvig, Alon Halevy. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 2009
3
Scability
● Strong scaling: if you throw twice as many
machines at the task, you solve it in half the time.
Usually relevant when the task is CPU bound.
● Weak scaling: if the dataset is twice as big, throw
twice as many machines at it to solve the task in
constant time.
Memory bound tasks... usually.
Most “big data” problems are I/O bound. Hard to solve the task in an acceptable
time independently of the size of the data (weak scaling).
4
Why Distributed ML ?
● The usual answer is that data are too big to be
stored in one computer !
● Some say that because “Hadoop”, “Spark” and
“MapReduce” are buzzwords
– No, we should never believe buzzwords
– For most jobs, multi-machine architectures are not
probably required at all.
● Let's argue that things are more complicated than
we thought...
5
When Distributed ML ?
● When the data or method could not be fitted in
one computer system, we imagine about using
multiple computers to solve problems.
● Typically, when one or more of the following
increase crazy, Distributed ML can be thought of
– Number of data points (Sensor networks)
– Number of attributes (Image data)
– Number of model parameters (DeepNN)
6
Some Assumptions..
 Much of the current work in large scale
learning makes the standard assumptions
about the data:
– That it is drawn IID from a stationary distribution
– That linear time algorithms are cheap, while
super-linear time algorithms are expensive.
– All the data are available and ready for use.
7
Big Data! assumptions broken..
● Data is generated from real world, and hardly anything in real
world follows a stationary distribution – IID broken.
● Data arrives at different speed and different times – Availability
broken.
● Sometimes, offline processing is allowed; so super-linear
algorithms are infact feasible.
● Sparsity – Is that an advantage or a pain point?
– When the data dimensionality is higher, the data set mostly ends up
sparse; ex: Text data.
– Applying transformations make data dense.
– Continuous vs Categorical vs Cardinal attributes.
8
Size of Training Data
● Say you have 1K labeled and 1M unlabeled
– Labeled/unlabeled ratio: 0.1%
– Is 1K enough to train a supervised model?
● Now you have 1M labeled and 1B unlabeled
– Labeled/unlabeled ratio: 0.1%
– Is 1M enough to train a supervised model?
9
Class Imbalance
● Given a classification problem,
– What is the guarantee that the class proportions are uniform?
● What is the problem if the there are imbalanced classes?
● Can't the Machine Learning algorithms handle class
imbalance in classification problems?
● What if the class imbalance is very very skewed?
– Ex: 1% vs 99%
● Why accuracy cannot be a good measure here?
– Label everything with the majority label to get 99% accuracy!
10
Curse of Dimensionality
● Exponential increase in volume associated with adding extra
dimensions
● Joint problem of the data and the algorithm being applied
● Expected edge length (e), to cover volume ratio (r), is given by
where the dimensions of the data is (p).
– 1% volume implies 63% of the edge
– 10% volume imples 80% of the edge
● Hughes Effect: With a fixed number of training samples, the predictive
power reduces as the dimensionality increases
● Solve this problem using a standard dimension reduction techniques
such as Linear Discriminant Analysis, Principal Component Analysis,
SVD, etc.
11
Overfitting
● When lot of data is available, it inherently means, possibility for lot of
noise!
● Overfit statistical models describe ”noise” instead of the underlying
input-output relationship.
● Generally occurs when a model is excessively complex, such as
having too many parameters relative to the number of observations.
● Model which has been over-fit will generally have poor predictive
performance, as it can exaggerate minor fluctuations in the data.
● Avoid overfitting by:
– Regularization, Cross Validation, Early Stopping, Pruning.
12
Method Complexities
 Network latency and disk read times may dominate
the cost of some learning algorithms
– One pass over the data is expensive
– Multiple passes may be out of the question
 Because reading the data dominates costs, we can
do intensive computation in a given locality without
significantly impacting cost
– Read the data once into memory, do several hundred
passes, read the next block, . . .
– Super-linear algorithms aren't so bad?
13
Slow Arrival Problem

A lot of big data doesn't arrive all at once. We only get a chunk of
the data every so often
– Transactional data, Sensor data

Streaming algorithm, incremental updates
– Good, but limits our options somewhat
– Typically have to make choices about how long it takes for data to “expire”
(e.g., learning rate)
 Lazy accumulators / Reservoir sampling
– Lazy algorithms limit options
– Reservoir sampling isn't using all data
– Implicit expiry of data is “never”

Window-based retraining
– Completely forgets past data
– Window size is an explicit choice
14
Let’s Start with An Example
●
Using a linear classifier LIBLINEAR (Fan et al., 2008) to train the
rcv1 (RCV1: A New Benchmark Collection for Text Categorization
Research) document data sets (Lewis et al., 2004).
● # instances: 677,399, # features: 47,236
●
On a typical PC
– $time ./train rcv1_test.binary
●
Total time: 50.88 seconds
●
Loading time: 43.51 seconds
● For this example
– loading time >> running time
15
Loading vs Running Time
● Let’s assume memory hierarchy contains only disk
● Assume # instances is N; Loading time: N×(a big constant);
Running time: Nk
×(some constant), where k ≥ 1.
● Traditionality Machine Learning and Data Mining methods
consider only running time.
– If running time dominates, then we should design algorithms to reduce
number of operations
– If loading time dominates, then we should design algorithms to reduce
number of data accesses
● Distributed environment is another layer of memory hierarchy
– So things become even more complicated
16
Adv. of Distrib. Storage
● One apparent reason of using distributed clusters is that
data is too large for one disk (ex: Petabytes of data)
● Parallel data loading:
– Reading several TB data from disk a⇒ few hours taken
– Using 100 machines;
● each has 1/100 data in its local disk a⇒ few minutes
● Fault tolerance: Some data replicated across machines:
if one fails, others are still available
– how to efficiently/effectively do this is a challenge
17
Distributed Systems
● Distributed file systems
– We need it because a file is now managed at different nodes
– A file is split in to chunks and each chunk is replicated
⇒ if some nodes fail, data is still available
– Example: GFS (Google file system), HDFS (Hadoop file system)
● Parallel programming frameworks
– A framework is like a language or a specification. You can then
have different implementations
18
Out of Core Computing
Very fast online learning learning. One
thread to read, one to train.
Hashing trick, online error, etc.
Parallel matrix multiplication.
Bottleneck tends to be CPU-GPU
memory transfer.
● Problem: Training Data does not fit in RAM
● Solution: Lay out data efficiently on disk
and load it as needed in memory.
19
Map-Reduce
● MapReduce (Dean and Ghemawat,
2008). A framework now commonly
used for large-scale data processing
● In MapReduce, every element is a
(key, value) pair
– Mapper: a list of data elements
provided. Each element
transformed to an output element
– Reducer: values with same key
presented to a single reducer
20
Map-Reduce:
Statistical Query Model
The sum corresponds
to the reduce operation
f, the map function,
is sent to every machine
21
Map Reduce (Salient Points)
● Resilient to failure. HDFS
disk replication.
● Can run on huge clusters.
● Makes use of data locality.
Program (query) is moved to
the data and not the
opposite.
● Map functions must be
stateless
States are lost between
map iterations.
● Computation graph is very
constrained by
independencies.
Not ideal for computation
on arbitrary graphs.
22
From Hadoop to Spark
Disk I/O coupling & Stateless mapping makes MapReduce ineffective to Iterative Algorithms
Spark (Zaharia et al, 2010) supports data caching between iterations
23
MPI vs MapReduce
● MPI: communication explicitly specified
● MPI is like an assembly language, but
● MPI: sends/receives data to/from a
node’s memory
●
MPI: no fault tolerance
●
MapReduce: communication
performed implicitly
●
MapReduce is high-level
●
MapReduce: communication
involves expensive disk I/O
●
MapReduce: support fault tolerance
MPI (Snir and Otto, 1998) is a parallel programming framework;
MPICH is a portable implementation of MPI standard. CH stands for Chameleon a portable
programming language by Willian Gropp.
MPICH2 is a popular and most adapted implementation of MPI, which was used as foundation
for IBM MPI, Intel MPI, Cray MPI, MS MPI, etc.
24
Some Distributed/Scalable
Machine Learning frameworks
● Vowpal-Wabbit – Out of Core ML framework
● Mahout on Hadoop
● MLlib on Spark
● GraphLab – Vertex Parallel Programming
● Pregel – Large Scale Graph Processing
● ParameterServer – Big Models
25
Evaluation
● Traditionally a parallel
program is evaluated
by scalability
● We expect, when #
machines are doubled,
speedup is doubled
(strong scaling).
● But, it does not linearly
scale. Why?
26
Data Locality – One Reason
● Transferring data across networks is slow. So, we should try to
access data from local disk
● Hadoop tries to move computation to the data.
If data in node A, try to use node A for computation
● But most machine learning algorithms are not designed to achieve
good data locality.
● Traditional parallel machine learning algorithms distribute
computation to nodes
– This works well in dedicated parallel machines with fast communication
among nodes
– But in data-center environments this may not work communication cost is⇒
very high
27
Summary
● Going to distributed or not is sometimes a
difficult decision
● There are many considerations
– Data already in distributed file systems or not
– The availability of distributed learning algorithms
for your problems
– The efforts for writing a distributed code
– The selection of parallel frameworks
28
Classification Vs Clustering
● Which type of machine-learning problems,
distributed computing is suitable?
Classification Clustering
You may not need to use all your training
data. Many training data + a so-so method
may not be better than some training data +
an advanced method. [mostly iterative]
If you have N data instances, you need
to cluster all of them [parallelizable]
29
A Multi-class Problem
● Problem: 55M documents, 29 classes, 3-100M features
depending on settings.
● Does this qualify for distributed computing?
● Not necessarily !
● With a 75G RAM machine, LIBLINEAR takes about 20 mins
to train the classifier with 55M docs and 3M features.
● In single computer, room to try different features is available
and hence opportunity to increase accuracy.
● You go to Distributed framework, only when you have
designed your experiement completely !!
30
A bagging implementation
● Assume data is large, say 1TB. You have 10 machines with 100GB RAM each.
● One way to train this large data is a bagging approach
– machine 1 trains 1/10 data
– machine 2 trains 1/10 data
– ..
– Machine 10 trains 1/10 data
● Then use 10 models for prediction and combine results.
● Obvious Reason: parallel data loading and parallel computation
● But it is not that simple if MapReduce/Hadoop is used.
● HDFS is not designed for easily copying a subset of data to a node!
– Assign 10 different keys to data, such that each key land in a reducer.
31
Before going to
Distributed Computing...
● Remember that they (ex: Hadoop) are not
designed in particular for machine learning
applications.
● We need to know when and where they are
suitable to be used.
● Also whether your data are already in
distributed systems or not, is important.
32
Distributed Algorithms..
● Let's try to see how well known machine learning
algorithms for classification and clustering are
converted to distributed algo.
– Distributed K-Means & Parallel Spectral Clustering
Chen et al, IEEE PAMI 2011
– DC-SVM: A Divide-and-Conquer Solver for Kernel Support
Vector Machines
Hsieh, Si, Dhillion, ICML 2014.
– LDA: An Architecture for Parallel Topic Models
Smola, Narayanamurthy, VLDB 2010.
33
Clustering
K-Means, Spectral Clustering
34
Distributed K-Means
● Let's discuss difference between MPI and
MapReduce implementations of k-means
35
K-Mean: MPI
● Broadcast initial centers to all machines
● While not converged
– Each node assigns its data to k clusters and
– compute local sum of each cluster
– An MPI AllReduce operation obtains sum of all k clusters to find new
centers
● Communication versus computation:
●
If x R∈ d
, then
– transfer k×d elements after k × d × N/p operations,
● N: total number of data and p: number of nodes.
36
K-Means: Map Reduce
● Thomas Jungblut http://codingwiththomas.blogspot.com/2011/05/k-means-clustering-with-
mapreduce.html
●
You don’t specifically assign data to nodes
– That is, data has been stored somewhere at HDFS
● Each instance: a (key, value) pair
– key: its associated cluster center
– value: the instance
● Map: Each (key, value) pair find the closest center and update the key
● Reduce: For instances with the same key (cluster), calculate the new
cluster center
● Given you don’t control where data points are, it’s unclear how
expensive loading and communication is!!
37
Spectral Clustering
● Input: Data points x1
, . . . , xn
; k: number of desired clusters.
●
Construct similarity matrix S R∈ n×n
.
● Modify S to be a sparse matrix.
● Construct D, the degree matrix.
● Compute the Symmetric Laplacian matrix L by
L = I − D−1/2
SD−1/2
,
● Compute the first k eigenvectors of L; and construct
●
V R∈ n×k
, whose columns are the k eigenvectors.
● Use k-means algorithm to cluster n rows of V into k groups.
38
Challenges
● Similarity matrix
– Only done once: suitable for MapReduce
– But size grows in O(n2
)
● First k Eigenvectors
– Implicitly restarted Arnoldi is iterative
– Iterative: not suitable for MapReduce
– MPI is used but no fault tolerance
● Parallel Similarity Matrix using MPI/MapReduce
● Parallel ARPACK using MPI
● Parallel K-Means using MPI/MapReduce
39
Sample Result
● 2,121,863 points and 1,000 classes
40
How to Scale up?
●
We can see that scalability of eigen decomposition is not good!!
● We can see two bottlenecks
– computation: O(n2
) similarity matrix
– communication: finding eigenvectors
●
To handle even larger sets we may need to modify the algorithm
●
For example, we can use only part of the similarity matrix (e.g., Nystrom
approximation); Slightly worse performance, but may scale up better
●
The decision relies on your number of data and other considerations
41
Classification
Distributed SVM method
42
Support Vector Machines
● Given:
– Training data points x1
, · · · , xn
.
– Each xi
R∈ d
is a feature vector:
– Consider a simple case with two
classes: yi
{+1, −1}.∈
● Goal: Find a hyperplane to
separate these two classes of data:
– if yi
=1, wT
xi
≥ 1−ξi
– if yi
=−1, wT
xi
≤ −1 + ξi
43
Linearly non-separable?
● Basis Expansion:
Map data xi
to higher
dimensional (maybe infinite)
feature space φ(xi
), where they
are linearly separable.
● Kernel trick
K(xi
, xj
) = φ(xi
)T
φ(xj
).
● Various types of kernels
– Gaussian: K(x, y) = e−γ||x−y||2
– Polynomial: K(x, y) = (γ xT
y + c)d
44
Support Vector Machines
● Training data {yi
, xi
},
– xi
R∈ d
, i = 1, . . . , N; yi
= ±1
● SVM solves the following optimization problem, with
a regularization term.
● Decision function, where φ(x) is the basis
expansion function:
45
Bottlenecks
● Assume gaussian kernel and a square
matrix K(x,y) = e−γ||x−y||2
,
– space complexity is O(n2
)
– compute time complexity is O(n2
d)
– Example: when N = 1M, space required for
kernel is 1M x 1M x 8B = 8TB
● Existing methods try not to use the whole
kernel matrix at the same time.
46
Dual Problem of SVM
● Challenge for solving kernel SVMs:
●
Space: O(n2
);
●
Time: O(n3
), assume O(n) support vectors.
● n = Number of variables = Number of samples.
47
Scability
● LIBSVM takes more than 8 hours to train on a CoverType dataset
with 0.5 million samples (with prediction accuracy 96%).
● Many inexact solvers have been developed:
AESVM (Nadan et al., 2014), Budgeted SVM (Wang et al., 2012), Fastfood (Le et al.,2013),
Cascade SVM (Graf et al., 2005), . . .
1-3 hours, with prediction accuracy 85 − 90%.
● Divide the problem into smaller subproblems – DC-SVM 11 minutes,
with prediction accuracy 96%.
48
DC-SVM with Single Level
Data Division
49
DC-SVM: Conquer step
It is shown that the cluster objective function is dependent on D(π), the between-
cluster error, given σn
is the smallest eigenvalue of the kernel matrix.
50
Quality of α (solution from subproblems)
51
Kernel K-means clustering
● Want a partition which
– Minimizes D(π) = ∑i,j:π(xi)≠π(xj)
|K(xi
, xj
)|.
– Have balanced cluster sizes (for efficient training).
● Use kernel kmeans (but slow); Use Two step kernel
kmeans instead:
– Run kernel kmeans on a subset samples of size M << N
to find cluster centers.
– Identify the clusters for the rest of data.
Software available at: http://www.cs.utexas.edu/~cjhsieh/dcsvm
53
Document Modeling
Latent Dirichlet Allocation
54
Topic Models
● Topic Models for Text
– Text Documents are composed of Topics
– Each Topic can generate a set of words
– P(w,d) = ∑z
P(w|z)P(z|d) = ∑z
P(w|z)P(d|z)P(z)
● Basic idea
– Each word 'w' or a document 'd' can be represented as a topic vector
– Each topic can be enumerated as a ranked list of words.
● For a query
– “ice skating”
● LDA (Blei et al., 2003) can infer from “ice” that “skating” is closer to a
topic “sports” rather than a topic “computer”
55
Latent Dirichlet Allocation
● Wij
: jth word from ith document
● p(wij
|zij
,Φ) and p(zij
|Θi
): multinomial distributions, that is wij
is
drawn from zij
, Φ and zij
is drawn from Θi
.
● p(Θi
|α), p(Φj
|β): Dirichlet
distributions.
● α,β: priors of Θ, Φ
respectively.
56
Gibbs Sampling
● Maximizing the likelihood is not easy, so
Griffiths and Steyvers (2004) propose using
Gibbs sampling to iteratively estimate the
posterior p(z|w)
● While the model looks complicated, Θ and
Φ can be integrated out to p(w, z|α, β)
● Then at each iteration, only a counting
procedure is needed.
57
LDA Algorithm
● For each iteration
– For each document i
● For each word j in document i
– Sampling and counting
● Distributed learning seems straightforward
– Divide data to several nodes
– Each node counts local data
– Models are summed up!
58
Smola et al, 2010
● A direct MapReduce implementation may not be efficient due to
I/O at each iteration
● Smola and Narayanamurthy (2010) use quite sophisticated
techniques to get high throughputs
– They don’t partition documents to several machines. Otherwise
machines need to wait for synchronization
– Instead, they consider several samplers and synchronize between
them
– They used memcached so data stored in memory rather than disk
– They used Hadoop streaming, so C++ rather than Java is used.
59
Conclusion
● Distributed machine learning is still an active research topic
● It is related to both machine learning and systems
● While machine learning people can’t develop systems, they
need to know how to choose systems
● An important fact is that existing distributed systems or
parallel frameworks are not particularly designed for machine
learning algorithms
● Machine learning people can
– help to affect how systems are designed
– design new algorithms for existing systems

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
kafka
kafkakafka
kafka
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 
ML Production Pipelines: A Classification Model
ML Production Pipelines: A Classification ModelML Production Pipelines: A Classification Model
ML Production Pipelines: A Classification Model
 
An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine Learning
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Prometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring SystemPrometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring System
 
Lstm
LstmLstm
Lstm
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Kubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleKubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by Example
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine Learning
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 

Andere mochten auch

Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
pauldix
 

Andere mochten auch (10)

Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
 
CCG
CCGCCG
CCG
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Distributed machine learning examples
Distributed machine learning examplesDistributed machine learning examples
Distributed machine learning examples
 
Software Patterns
Software PatternsSoftware Patterns
Software Patterns
 
Neural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsNeural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for Physicists
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
 
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2ODeep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
 

Ähnlich wie Challenges in Large Scale Machine Learning

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 

Ähnlich wie Challenges in Large Scale Machine Learning (20)

Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
C3 w3
C3 w3C3 w3
C3 w3
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Cloud accounting software uk
Cloud accounting software ukCloud accounting software uk
Cloud accounting software uk
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
OpenHPI - Parallel Programming Concepts - Week 5
OpenHPI - Parallel Programming Concepts - Week 5OpenHPI - Parallel Programming Concepts - Week 5
OpenHPI - Parallel Programming Concepts - Week 5
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 

Mehr von Sudarsun Santhiappan

Essentials for a Budding IT professional
Essentials for a Budding IT professionalEssentials for a Budding IT professional
Essentials for a Budding IT professional
Sudarsun Santhiappan
 

Mehr von Sudarsun Santhiappan (12)

Search Engine Demystified
Search Engine DemystifiedSearch Engine Demystified
Search Engine Demystified
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Essentials for a Budding IT professional
Essentials for a Budding IT professionalEssentials for a Budding IT professional
Essentials for a Budding IT professional
 
What it takes to be the Best IT Trainer
What it takes to be the Best IT TrainerWhat it takes to be the Best IT Trainer
What it takes to be the Best IT Trainer
 
Using Behavioral Patterns In Treating Autistic
Using Behavioral Patterns In Treating AutisticUsing Behavioral Patterns In Treating Autistic
Using Behavioral Patterns In Treating Autistic
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Audio And Video Over Internet
Audio And Video Over InternetAudio And Video Over Internet
Audio And Video Over Internet
 
Practical Network Security
Practical Network SecurityPractical Network Security
Practical Network Security
 
How To Do A Project
How To Do A ProjectHow To Do A Project
How To Do A Project
 
Ontology
OntologyOntology
Ontology
 
Object Oriented Design
Object Oriented DesignObject Oriented Design
Object Oriented Design
 

Kürzlich hochgeladen

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 

Kürzlich hochgeladen (20)

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 

Challenges in Large Scale Machine Learning

  • 1. Challenges in Large Scale Machine Learning Sudarsun Santhiappan RISE Lab IIT Madras Disclaimer: The slide contents are retreived from the free Internet but cleverly edited, sequenced and stitched for academic lecturing purposes.
  • 2. 2 Why all the hype? Max Welling, ICML 2014 Big Models Big Data Peter Norvig, Alon Halevy. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 2009
  • 3. 3 Scability ● Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound. ● Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks... usually. Most “big data” problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling).
  • 4. 4 Why Distributed ML ? ● The usual answer is that data are too big to be stored in one computer ! ● Some say that because “Hadoop”, “Spark” and “MapReduce” are buzzwords – No, we should never believe buzzwords – For most jobs, multi-machine architectures are not probably required at all. ● Let's argue that things are more complicated than we thought...
  • 5. 5 When Distributed ML ? ● When the data or method could not be fitted in one computer system, we imagine about using multiple computers to solve problems. ● Typically, when one or more of the following increase crazy, Distributed ML can be thought of – Number of data points (Sensor networks) – Number of attributes (Image data) – Number of model parameters (DeepNN)
  • 6. 6 Some Assumptions..  Much of the current work in large scale learning makes the standard assumptions about the data: – That it is drawn IID from a stationary distribution – That linear time algorithms are cheap, while super-linear time algorithms are expensive. – All the data are available and ready for use.
  • 7. 7 Big Data! assumptions broken.. ● Data is generated from real world, and hardly anything in real world follows a stationary distribution – IID broken. ● Data arrives at different speed and different times – Availability broken. ● Sometimes, offline processing is allowed; so super-linear algorithms are infact feasible. ● Sparsity – Is that an advantage or a pain point? – When the data dimensionality is higher, the data set mostly ends up sparse; ex: Text data. – Applying transformations make data dense. – Continuous vs Categorical vs Cardinal attributes.
  • 8. 8 Size of Training Data ● Say you have 1K labeled and 1M unlabeled – Labeled/unlabeled ratio: 0.1% – Is 1K enough to train a supervised model? ● Now you have 1M labeled and 1B unlabeled – Labeled/unlabeled ratio: 0.1% – Is 1M enough to train a supervised model?
  • 9. 9 Class Imbalance ● Given a classification problem, – What is the guarantee that the class proportions are uniform? ● What is the problem if the there are imbalanced classes? ● Can't the Machine Learning algorithms handle class imbalance in classification problems? ● What if the class imbalance is very very skewed? – Ex: 1% vs 99% ● Why accuracy cannot be a good measure here? – Label everything with the majority label to get 99% accuracy!
  • 10. 10 Curse of Dimensionality ● Exponential increase in volume associated with adding extra dimensions ● Joint problem of the data and the algorithm being applied ● Expected edge length (e), to cover volume ratio (r), is given by where the dimensions of the data is (p). – 1% volume implies 63% of the edge – 10% volume imples 80% of the edge ● Hughes Effect: With a fixed number of training samples, the predictive power reduces as the dimensionality increases ● Solve this problem using a standard dimension reduction techniques such as Linear Discriminant Analysis, Principal Component Analysis, SVD, etc.
  • 11. 11 Overfitting ● When lot of data is available, it inherently means, possibility for lot of noise! ● Overfit statistical models describe ”noise” instead of the underlying input-output relationship. ● Generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. ● Model which has been over-fit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. ● Avoid overfitting by: – Regularization, Cross Validation, Early Stopping, Pruning.
  • 12. 12 Method Complexities  Network latency and disk read times may dominate the cost of some learning algorithms – One pass over the data is expensive – Multiple passes may be out of the question  Because reading the data dominates costs, we can do intensive computation in a given locality without significantly impacting cost – Read the data once into memory, do several hundred passes, read the next block, . . . – Super-linear algorithms aren't so bad?
  • 13. 13 Slow Arrival Problem  A lot of big data doesn't arrive all at once. We only get a chunk of the data every so often – Transactional data, Sensor data  Streaming algorithm, incremental updates – Good, but limits our options somewhat – Typically have to make choices about how long it takes for data to “expire” (e.g., learning rate)  Lazy accumulators / Reservoir sampling – Lazy algorithms limit options – Reservoir sampling isn't using all data – Implicit expiry of data is “never”  Window-based retraining – Completely forgets past data – Window size is an explicit choice
  • 14. 14 Let’s Start with An Example ● Using a linear classifier LIBLINEAR (Fan et al., 2008) to train the rcv1 (RCV1: A New Benchmark Collection for Text Categorization Research) document data sets (Lewis et al., 2004). ● # instances: 677,399, # features: 47,236 ● On a typical PC – $time ./train rcv1_test.binary ● Total time: 50.88 seconds ● Loading time: 43.51 seconds ● For this example – loading time >> running time
  • 15. 15 Loading vs Running Time ● Let’s assume memory hierarchy contains only disk ● Assume # instances is N; Loading time: N×(a big constant); Running time: Nk ×(some constant), where k ≥ 1. ● Traditionality Machine Learning and Data Mining methods consider only running time. – If running time dominates, then we should design algorithms to reduce number of operations – If loading time dominates, then we should design algorithms to reduce number of data accesses ● Distributed environment is another layer of memory hierarchy – So things become even more complicated
  • 16. 16 Adv. of Distrib. Storage ● One apparent reason of using distributed clusters is that data is too large for one disk (ex: Petabytes of data) ● Parallel data loading: – Reading several TB data from disk a⇒ few hours taken – Using 100 machines; ● each has 1/100 data in its local disk a⇒ few minutes ● Fault tolerance: Some data replicated across machines: if one fails, others are still available – how to efficiently/effectively do this is a challenge
  • 17. 17 Distributed Systems ● Distributed file systems – We need it because a file is now managed at different nodes – A file is split in to chunks and each chunk is replicated ⇒ if some nodes fail, data is still available – Example: GFS (Google file system), HDFS (Hadoop file system) ● Parallel programming frameworks – A framework is like a language or a specification. You can then have different implementations
  • 18. 18 Out of Core Computing Very fast online learning learning. One thread to read, one to train. Hashing trick, online error, etc. Parallel matrix multiplication. Bottleneck tends to be CPU-GPU memory transfer. ● Problem: Training Data does not fit in RAM ● Solution: Lay out data efficiently on disk and load it as needed in memory.
  • 19. 19 Map-Reduce ● MapReduce (Dean and Ghemawat, 2008). A framework now commonly used for large-scale data processing ● In MapReduce, every element is a (key, value) pair – Mapper: a list of data elements provided. Each element transformed to an output element – Reducer: values with same key presented to a single reducer
  • 20. 20 Map-Reduce: Statistical Query Model The sum corresponds to the reduce operation f, the map function, is sent to every machine
  • 21. 21 Map Reduce (Salient Points) ● Resilient to failure. HDFS disk replication. ● Can run on huge clusters. ● Makes use of data locality. Program (query) is moved to the data and not the opposite. ● Map functions must be stateless States are lost between map iterations. ● Computation graph is very constrained by independencies. Not ideal for computation on arbitrary graphs.
  • 22. 22 From Hadoop to Spark Disk I/O coupling & Stateless mapping makes MapReduce ineffective to Iterative Algorithms Spark (Zaharia et al, 2010) supports data caching between iterations
  • 23. 23 MPI vs MapReduce ● MPI: communication explicitly specified ● MPI is like an assembly language, but ● MPI: sends/receives data to/from a node’s memory ● MPI: no fault tolerance ● MapReduce: communication performed implicitly ● MapReduce is high-level ● MapReduce: communication involves expensive disk I/O ● MapReduce: support fault tolerance MPI (Snir and Otto, 1998) is a parallel programming framework; MPICH is a portable implementation of MPI standard. CH stands for Chameleon a portable programming language by Willian Gropp. MPICH2 is a popular and most adapted implementation of MPI, which was used as foundation for IBM MPI, Intel MPI, Cray MPI, MS MPI, etc.
  • 24. 24 Some Distributed/Scalable Machine Learning frameworks ● Vowpal-Wabbit – Out of Core ML framework ● Mahout on Hadoop ● MLlib on Spark ● GraphLab – Vertex Parallel Programming ● Pregel – Large Scale Graph Processing ● ParameterServer – Big Models
  • 25. 25 Evaluation ● Traditionally a parallel program is evaluated by scalability ● We expect, when # machines are doubled, speedup is doubled (strong scaling). ● But, it does not linearly scale. Why?
  • 26. 26 Data Locality – One Reason ● Transferring data across networks is slow. So, we should try to access data from local disk ● Hadoop tries to move computation to the data. If data in node A, try to use node A for computation ● But most machine learning algorithms are not designed to achieve good data locality. ● Traditional parallel machine learning algorithms distribute computation to nodes – This works well in dedicated parallel machines with fast communication among nodes – But in data-center environments this may not work communication cost is⇒ very high
  • 27. 27 Summary ● Going to distributed or not is sometimes a difficult decision ● There are many considerations – Data already in distributed file systems or not – The availability of distributed learning algorithms for your problems – The efforts for writing a distributed code – The selection of parallel frameworks
  • 28. 28 Classification Vs Clustering ● Which type of machine-learning problems, distributed computing is suitable? Classification Clustering You may not need to use all your training data. Many training data + a so-so method may not be better than some training data + an advanced method. [mostly iterative] If you have N data instances, you need to cluster all of them [parallelizable]
  • 29. 29 A Multi-class Problem ● Problem: 55M documents, 29 classes, 3-100M features depending on settings. ● Does this qualify for distributed computing? ● Not necessarily ! ● With a 75G RAM machine, LIBLINEAR takes about 20 mins to train the classifier with 55M docs and 3M features. ● In single computer, room to try different features is available and hence opportunity to increase accuracy. ● You go to Distributed framework, only when you have designed your experiement completely !!
  • 30. 30 A bagging implementation ● Assume data is large, say 1TB. You have 10 machines with 100GB RAM each. ● One way to train this large data is a bagging approach – machine 1 trains 1/10 data – machine 2 trains 1/10 data – .. – Machine 10 trains 1/10 data ● Then use 10 models for prediction and combine results. ● Obvious Reason: parallel data loading and parallel computation ● But it is not that simple if MapReduce/Hadoop is used. ● HDFS is not designed for easily copying a subset of data to a node! – Assign 10 different keys to data, such that each key land in a reducer.
  • 31. 31 Before going to Distributed Computing... ● Remember that they (ex: Hadoop) are not designed in particular for machine learning applications. ● We need to know when and where they are suitable to be used. ● Also whether your data are already in distributed systems or not, is important.
  • 32. 32 Distributed Algorithms.. ● Let's try to see how well known machine learning algorithms for classification and clustering are converted to distributed algo. – Distributed K-Means & Parallel Spectral Clustering Chen et al, IEEE PAMI 2011 – DC-SVM: A Divide-and-Conquer Solver for Kernel Support Vector Machines Hsieh, Si, Dhillion, ICML 2014. – LDA: An Architecture for Parallel Topic Models Smola, Narayanamurthy, VLDB 2010.
  • 34. 34 Distributed K-Means ● Let's discuss difference between MPI and MapReduce implementations of k-means
  • 35. 35 K-Mean: MPI ● Broadcast initial centers to all machines ● While not converged – Each node assigns its data to k clusters and – compute local sum of each cluster – An MPI AllReduce operation obtains sum of all k clusters to find new centers ● Communication versus computation: ● If x R∈ d , then – transfer k×d elements after k × d × N/p operations, ● N: total number of data and p: number of nodes.
  • 36. 36 K-Means: Map Reduce ● Thomas Jungblut http://codingwiththomas.blogspot.com/2011/05/k-means-clustering-with- mapreduce.html ● You don’t specifically assign data to nodes – That is, data has been stored somewhere at HDFS ● Each instance: a (key, value) pair – key: its associated cluster center – value: the instance ● Map: Each (key, value) pair find the closest center and update the key ● Reduce: For instances with the same key (cluster), calculate the new cluster center ● Given you don’t control where data points are, it’s unclear how expensive loading and communication is!!
  • 37. 37 Spectral Clustering ● Input: Data points x1 , . . . , xn ; k: number of desired clusters. ● Construct similarity matrix S R∈ n×n . ● Modify S to be a sparse matrix. ● Construct D, the degree matrix. ● Compute the Symmetric Laplacian matrix L by L = I − D−1/2 SD−1/2 , ● Compute the first k eigenvectors of L; and construct ● V R∈ n×k , whose columns are the k eigenvectors. ● Use k-means algorithm to cluster n rows of V into k groups.
  • 38. 38 Challenges ● Similarity matrix – Only done once: suitable for MapReduce – But size grows in O(n2 ) ● First k Eigenvectors – Implicitly restarted Arnoldi is iterative – Iterative: not suitable for MapReduce – MPI is used but no fault tolerance ● Parallel Similarity Matrix using MPI/MapReduce ● Parallel ARPACK using MPI ● Parallel K-Means using MPI/MapReduce
  • 39. 39 Sample Result ● 2,121,863 points and 1,000 classes
  • 40. 40 How to Scale up? ● We can see that scalability of eigen decomposition is not good!! ● We can see two bottlenecks – computation: O(n2 ) similarity matrix – communication: finding eigenvectors ● To handle even larger sets we may need to modify the algorithm ● For example, we can use only part of the similarity matrix (e.g., Nystrom approximation); Slightly worse performance, but may scale up better ● The decision relies on your number of data and other considerations
  • 42. 42 Support Vector Machines ● Given: – Training data points x1 , · · · , xn . – Each xi R∈ d is a feature vector: – Consider a simple case with two classes: yi {+1, −1}.∈ ● Goal: Find a hyperplane to separate these two classes of data: – if yi =1, wT xi ≥ 1−ξi – if yi =−1, wT xi ≤ −1 + ξi
  • 43. 43 Linearly non-separable? ● Basis Expansion: Map data xi to higher dimensional (maybe infinite) feature space φ(xi ), where they are linearly separable. ● Kernel trick K(xi , xj ) = φ(xi )T φ(xj ). ● Various types of kernels – Gaussian: K(x, y) = e−γ||x−y||2 – Polynomial: K(x, y) = (γ xT y + c)d
  • 44. 44 Support Vector Machines ● Training data {yi , xi }, – xi R∈ d , i = 1, . . . , N; yi = ±1 ● SVM solves the following optimization problem, with a regularization term. ● Decision function, where φ(x) is the basis expansion function:
  • 45. 45 Bottlenecks ● Assume gaussian kernel and a square matrix K(x,y) = e−γ||x−y||2 , – space complexity is O(n2 ) – compute time complexity is O(n2 d) – Example: when N = 1M, space required for kernel is 1M x 1M x 8B = 8TB ● Existing methods try not to use the whole kernel matrix at the same time.
  • 46. 46 Dual Problem of SVM ● Challenge for solving kernel SVMs: ● Space: O(n2 ); ● Time: O(n3 ), assume O(n) support vectors. ● n = Number of variables = Number of samples.
  • 47. 47 Scability ● LIBSVM takes more than 8 hours to train on a CoverType dataset with 0.5 million samples (with prediction accuracy 96%). ● Many inexact solvers have been developed: AESVM (Nadan et al., 2014), Budgeted SVM (Wang et al., 2012), Fastfood (Le et al.,2013), Cascade SVM (Graf et al., 2005), . . . 1-3 hours, with prediction accuracy 85 − 90%. ● Divide the problem into smaller subproblems – DC-SVM 11 minutes, with prediction accuracy 96%.
  • 48. 48 DC-SVM with Single Level Data Division
  • 49. 49 DC-SVM: Conquer step It is shown that the cluster objective function is dependent on D(π), the between- cluster error, given σn is the smallest eigenvalue of the kernel matrix.
  • 50. 50 Quality of α (solution from subproblems)
  • 51. 51 Kernel K-means clustering ● Want a partition which – Minimizes D(π) = ∑i,j:π(xi)≠π(xj) |K(xi , xj )|. – Have balanced cluster sizes (for efficient training). ● Use kernel kmeans (but slow); Use Two step kernel kmeans instead: – Run kernel kmeans on a subset samples of size M << N to find cluster centers. – Identify the clusters for the rest of data. Software available at: http://www.cs.utexas.edu/~cjhsieh/dcsvm
  • 53. 54 Topic Models ● Topic Models for Text – Text Documents are composed of Topics – Each Topic can generate a set of words – P(w,d) = ∑z P(w|z)P(z|d) = ∑z P(w|z)P(d|z)P(z) ● Basic idea – Each word 'w' or a document 'd' can be represented as a topic vector – Each topic can be enumerated as a ranked list of words. ● For a query – “ice skating” ● LDA (Blei et al., 2003) can infer from “ice” that “skating” is closer to a topic “sports” rather than a topic “computer”
  • 54. 55 Latent Dirichlet Allocation ● Wij : jth word from ith document ● p(wij |zij ,Φ) and p(zij |Θi ): multinomial distributions, that is wij is drawn from zij , Φ and zij is drawn from Θi . ● p(Θi |α), p(Φj |β): Dirichlet distributions. ● α,β: priors of Θ, Φ respectively.
  • 55. 56 Gibbs Sampling ● Maximizing the likelihood is not easy, so Griffiths and Steyvers (2004) propose using Gibbs sampling to iteratively estimate the posterior p(z|w) ● While the model looks complicated, Θ and Φ can be integrated out to p(w, z|α, β) ● Then at each iteration, only a counting procedure is needed.
  • 56. 57 LDA Algorithm ● For each iteration – For each document i ● For each word j in document i – Sampling and counting ● Distributed learning seems straightforward – Divide data to several nodes – Each node counts local data – Models are summed up!
  • 57. 58 Smola et al, 2010 ● A direct MapReduce implementation may not be efficient due to I/O at each iteration ● Smola and Narayanamurthy (2010) use quite sophisticated techniques to get high throughputs – They don’t partition documents to several machines. Otherwise machines need to wait for synchronization – Instead, they consider several samplers and synchronize between them – They used memcached so data stored in memory rather than disk – They used Hadoop streaming, so C++ rather than Java is used.
  • 58. 59 Conclusion ● Distributed machine learning is still an active research topic ● It is related to both machine learning and systems ● While machine learning people can’t develop systems, they need to know how to choose systems ● An important fact is that existing distributed systems or parallel frameworks are not particularly designed for machine learning algorithms ● Machine learning people can – help to affect how systems are designed – design new algorithms for existing systems