4. Challenge: Scale
4
1. Massive amount of examples
âș NaĂŻve solutions take days/weeks
2. Billions of features
âș Model exceeds memory limits of 1 computer
3. Variety of algorithms
âș Different solutions required for scale-up
7. 7
Architecture for Scalable ML
ï§ ML Server
âș Customized in-memory
stores (Hashmap, Matrix)
âą Lockless concurrency
âą Zero garbage created
âș Map/Reduce API to move
computing to servers
8. 3 Examples of ML Algorithms
8 Yahoo Confidential & Proprietary
1. Gradient Boosted Decision Tree
âș Problem: Training latency
âș Solution: Hadoop streaming + MPI
2. Logistic Regression
âș Problem: Model size
âș Solution: Spark + ML Server
3. Ad-Query Vectors
âș Problem: Model size + Training latency
âș Solution: Spark + ML Server
9. Algorithm 1: Gradient Boosted Decision Tree
ï§ Boosting is sequential ï§ Training takes days for
1000s of features
16. Summary
16
ï§ Scalable machine learning at Yahoo
âș critical business: search, advertisement
âș daily model training w/ billions of features
ï§ Hadoop/YARN plays a central role
âș approximate computing
âș CPU + GPU
Good afternoon.
I am Andy Feng from Yahoo.
In this talk, I will share our recent effort to enable large scale machine learning on hadoop clusters.
In 2013, I talked about Yahooâs adoption of Storm for low-latency processing.
Last year, I described Yahooâs effort to bring Spark onto YARN cluster.
Today, we should cover our progress on machine learning using YARN clusters.
I will cover 3 areas:
WHY does Yahoo apply machine learning
WHAT challenges we try to address
HOW we address them
I will wrap the talk with key lessons learned from our experience.
Letâs start with WHY machine learning.
Search is one of the key applications for Yahoo. For a userâs search phrase, we construct a result page with organic contents together with ads.
To generate result page, we rank contents basedtheir relevance to query terms, match ads against query, and predict the probability of ad click.
Several machine learning algorithms are applied in this process: decision tree, logistic regression and neural network.
Machine learning at Yahoo has an scalability challenge.
1st, the # of training examples. In order to produce an accurate machine learning model, Yahoo examines massive amount of training examples. For example, we examine several months of user search activity logs.
Typically, we are looking at hundreds billions of training examples. When naĂŻve solutions are applied, the training process could take several weeks.
Models that represents what happened weeks ago have limited business value, since they donât represent the current state of our users and contents.
2nd the # of features. We need to pick up signals from all possible signals.
Itâs usual for Yahoo to consider billions of features in our model.
3rd, we use variety of algorithms, and different solutions are required for scaling up these algorithms.
We want our machine learning algorithms massive scalable.
We believe that Hadoop is an ideal platform for scalable machine learning.
Yahoo has one of the largest Hadoop deployments in the world.
At the moment we store 600 PB on 43 thousands nodes.
In last year, we decide to make hadoop the single best system for running large scale machine learning applications.
So lets look into a bit under the hood
At Yahoo, our data scientists are applying big-data machine learning on Hadoop clusters daily.
Here is a screenshot from one Hadoop cluster.
In addition to various MapReduce jobs, we have a Spark job for machine learning, and a ML server for managing data of ML models.
ML Server
Customized in-memory stores (Hashmap, Matrix)
Lockless concurrency
Zero garbage created
Map/Reduce API to move computing to servers
To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers.
These servers are a YARN application, specfically design for machine learning.
All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second.
Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection.
Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners.
To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers.
As an example, you may want to perform statistic analysis of large models using MapReduce operations.
Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
Let me share 3 success stories about machine learning algorithms.
Our 1st story illustrates how Hadoop and MPI could dramatically reduce training latency.
Our 2nd story shows Spark and YARN to enable training of very large machine learning models.
Our 3rd story will attack both model size and training latency.
Letâs start with our 1st story.
In gradient boosted decision trees, we represent model as a collection of decision trees. Each tree node represents a decision point using one feature.
By top-down walk through of these trees, you will reach leaf nodes with numerical values. Adding those numerical values together will be our prediction for a given example.
To achieve high accuracy, we tend to have many trees. In Yahoo use cases, we use thousands of trees.
To construct such trees, we have to construct trees one-by-one. Within each trees, we have to tree layer-by-layer.
For each node, we need to select a best feature and a best value for the split.
If you use a single machine, the training process could take several days.
At Yahoo, we developed a GBDT algorithm on top of Hadoop and MPI, and achieve 30X speed-up.
More specifically, we use Hadoop Streaming to launch multiple GBDT workers.
We partition training examples by columns, instead of rows.
Each worker has subset of features for all training examples.
During the training, each worker perform local computation to identical best splits for its feature set.
We then apply MPI allreduce operation to decide the global best split among all features,
and broadcast the best split to all workers.
We repeat this process for all tree nodes.
At the end, we have a collection of trees as our model.
In this approach, we could tens of Hadoop mappers, and fully utilize their computation power.
We achieved 30 times speedup for Yahoo use cases. For a training job previously took days,
we could now produce decision trees in about 1 hour.
Our 2nd story is about logistic regression.
For a given vector X, logistic regression predicts the outcome via a logistic function to the dot product of parameters beta and feature vector.
During training phase, we try to find the best parameter beta from training examples.
If we have 2 parameters, logistic regression is find a line in 2 dimensional space to best fit our examples.
The scalability challenge is around the # of parameters. We want to have produce model with over 100B parameters.
Assume that each parameter uses 16 bytes, the storage of our parameters require 1.6 TB.
We could not store the model in memory of a single computer.
To enable large models, we decide to use multiple servers in a YARN cluster. Each server keeps a subset of parameters in memory.
We launch logistic regression learners as a Spark job on YARN. Each learner will cover a subset of training examples from HDFS.
For each example, we will fetch current parameter values from servers, compute gradient, and update servers with latest value.
This new architecture enables us to scale up learning 1000 times.
Our previous models had at thousands or millions parameters, and our new model now has billions of parameters.
All learners are perform learning independently. There is no synchronous across learners at all.
Therefore, we could learn from massive amount of training data very quickly.
As a result, our model w/ billions of parameters is significantly more precise than our previous models.
That has brought us meaningful business impact.
Our 3rd story is related to search query and ads.
In this case, we are learning numerical vectors of search queries and ads from user session logs.
From these vectors, we will be able to know that query terms âsan jose weatherâ and âweather 95113â are essentially identical.
We learn vectors from userâs search sessions. Each search session will have a collection of query terms and ads.
Ad/query vectors are learned by applying n-gram techniques. Details of our algorithm is explained in a recent conference paper from yahoo labs.
In this use case, we have 2 problems.
First, we have billions of query terms. If each vector has 300 dimensions, we will need 2.4 TB memory space for vector storage. Thatâs way beyond our typical computer today.
Vector calculation is very expensive. For each search sessions, we need to perform hundreds vector operations such as multiplication and addition. For a relative small datasets, training could take weeks.
For computing vectors of queries and ads, we use a set of matrix servers on YARN cluster.
Each server has a subset of columns of our matrix.
These servers has built-in matrix operations such as vector multiplication and addition.
We use Spark job to launch multiple learners on a YARN cluster.
Each learner will examine a subset of training dataset.
To reduce data movement, we conduct majority of computation on servers.
For each training example,
we let each server to produce negative examples, and calculate gradients locally.
Then, our learner calculate a global coefficence based on each serverâs partial gradients.
Finally, we let each server adjust vectors.
This distributed solution enables us to train vectors within a few hours.
Remember it took several weeks for such a task previously.
100X speedup using YARN.
From these use cases, we learned one important lesson. That is, big-data approximate computing could produce more accurate models.
In all use cases, we use a set of computers to learning from dataset, and produce a mathematical model.
We want each learners to conduct their learning as fast as they can. We donât want any synchronization across learners.
We even let learners to overwrite each other in the shared data model.
Each execution may produce slightly different result.
We are performing approximate computing on YARN.
At the end, we produce a mathematical model with large # of parameters.
Since this model represent the signals from massive amount of data,
our model is more accurate than previous model built from precise computing.
In summary, Yahoo has made significant progress on scalable machine learning.
We conduct daily training w/ billions of signals for our critical business such as search and advertisement.
Hadoop and YARN are playing a central role for this evolution. In YARN cluster, we built a framework for approximate computing.
We are currently exploring both GPU and CPU in a single cluster.