SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
MapReduce: Distributed Computing for Machine Learning
                            Dan Gillick, Arlo Faria, John DeNero
                                        December 18, 2006


                                               Abstract
         We use Hadoop, an open-source implementation of Google’s distributed file system and the
     MapReduce framework for distributed data processing, on modestly-sized compute clusters to
     evaluate its efficacy for standard machine learning tasks. We show benchmark performance
     on searching and sorting tasks to investigate the effects of various system configurations. We
     also distinguish classes of machine-learning problems that are reasonable to address within the
     MapReduce framework, and offer improvements to the Hadoop implementation. We conclude
     that MapReduce is a good choice for basic operations on large datasets, although there are
     complications to be addressed for more complex machine learning tasks.


1    Introduction
The Internet provides a resource for compiling enormous amounts of data, often beyond the capacity
of individual disks, and too large for processing with a single CPU. Google’s MapReduce [3], built
on top of the distributed Google File System [4] provides a parallelization framework which has
garnered considerable acclaim for its ease-of-use, scalability, and fault-tolerance. As evidence of its
utility, Google points out that thousands of MapReduce applications have already been written by
its employees in the short time since the system was deployed [3]. Recent conference papers have
investigated MapReduce applications for machine learning algorithms run on multicore machines
[1] and MapReduce is discussed by systems luminary Jim Gray in a recent position paper [6]. Here
at Berkeley, there is even discussion of incorporating MapReduce programming into undergraduate
Computer Science classes as an introduction to parallel computing.
    The success at Google prompted the development of the Hadoop project [2], an open-source
attempt to reproduce Google’s implementation, hosted as a sub-project of the Apache Software
Foundation’s Lucene search engine library. Hadoop is still in early stages of development (latest
version: 0.92) and has not been tested on clusters of more than 1000 machines (Google, by contrast,
often runs jobs on many thousands of machines). The discussion that follows addresses the practical
application of the MapReduce framework to our research interests.
    Section 2 introduces the compute environments at our disposal. Section 3 provides benchmark
performance on standard MapReduce tasks. Section 4 outlines applications to a few relevant ma-
chine learning problems, discussing the virtues and limitations of MapReduce. Section 5 discusses
our improvements to the Hadoop implementation.




                                                   1
2     Compute Resources
While large companies may have thousands of networked computers, most academic researchers
have access to more modest resources. We installed Hadoop on the NLP Millennium cluster in
Soda Hall and at the International Computer Science Institute (ICSI), which are representative of
typical university research environments.
   Table 1 gives the details of the two cluster configurations used for the experiments in this
paper. The NLP cluster is a rack of high-performance computing servers, whereas the ICSI cluster
comprises a much larger number of personal desktop machines. It should be noted that ICSI also
has a compute cluster of dedicated servers: over 20 machines, each with up to 8 multicore processors
and 32 GB RAM, and a 1 Gbps link to terabyte RAID arrays. Compute jobs at ICSI are typically
exported to these compute servers using pmake-customs software.1 For this paper, we demonstrate
that the desktop machines at ICSI can also be used for effective parallel programming, providing
an especially useful alternative for I/O-intensive applications which would otherwise be crippled by
the pmake-customs network bottleneck.

                                              NLP Cluster       ICSI Desktop Cluster
                     Machines                 9                 80
                     CPUs per machine         2x 3.4 GHz        1x 2.4 GHz
                     RAM per machine          4 GB              1 GB
                     Storage per machine      300 GB            30 GB
                     Network Bandwidth        1 GBit/s          100 MBit/s

                       Table 1: Two clusters used in this paper’s experiments.



3     Benchmarks
We evaluate the performance of the 80-node ICSI desktop cluster to establish benchmark results
for given tasks and configurations. Following the presentation in [3], we consider two computations
that typify the majority of MapReduce algorithms: a grep that searches through a large amount of
data, and a sort that transforms it from one representation to another. These two tasks represent
extreme cases for the reduce phase, in which either none or all of the input data is shuffled across
the network and written as output.
    The data in these experiments are derived from the TeraSort benchmark [5], in which 100-byte
records comprise a random 10-byte sorting key and 90 bytes of “payload” data. In the comparable
experiments in [3], a cluster of 1800 machines was used to process a terabyte of data; due to our
limited resources, we scaled the experiments down to 10 gigabytes of data – i.e., our input was 108
100-byte records.
    We display our results in the same format as [3], showing the data transfer rates over the time
of the computation, distinguishing the map phase (reading input from local disks) from the reduce
phase (shuffling intermediate data over the network and writing the output to the distributed file
system).
  1 This was Adam de Boor’s CS 262 final project from 1989, and was updated by Andreas Stolcke for use at ICSI

and SRI.




                                                     2
Figure 1: Effect of varying the size of map tasks. Left: with 600 map tasks (16MB input splits),
performance suffers from repeated initialization costs. Right: with 75 map tasks (128MB input
splits), a high sustained read rate is attainable, but the coarse granularity does not allow good
load-balancing.


3.1    Distributed Grep
The distributed grep program searches for a character pattern and writes intermediate data only
when a match is encountered. To simplify this further, we can consider an empty map function,
which reads the input and never writes intermediate records. Using the Hadoop streaming protocol
[2], which reads records from STDIN and writes to STDOUT, the map function is trivial:
#!/usr/bin/perl
while(<STDIN>){
  # do nothing
}
The reduce function is not needed since there is no intermediate data. Thus, this contrived program
can be used to measure the maximal input data read rate for the map phase.
    Figure 1 shows the effect of varying the number of map tasks (equivalent to varying the size of an
input split). When the number of map tasks is high, reasonable load balancing is possible because
faster machines will have a chance to run more tasks than slower machines. Also, because the tasks
finish more quickly, the variance in running times is lower, and the effect of long-running “stragglers”
is minimized. However, each map task incurs some amount of startup overhead, evidenced by the
periodic bursts of data transfer in Figure 1 (left) caused by many tasks finishing and starting at
the same time. Thus, it could be desirable to minimize initialization overhead by having a smaller
number of map tasks, at the expense of poor load balancing, as in Figure 1 (right).
    Our 80-node cluster reaches a maximum data read rate of about 400 MB/s, or about 5 MB/s
per node. This is worse than Google’s 1800-node cluster [3], which peaked at 30GB/s, or about
15 MB/s per node. The individual desktop hard disks at ICSI were probably mostly idle at the
time, and were capable of reaching about 30-40 MB/s read rate (cold cache results), so the cluster’s
performance is far from its theoretical maximum efficiency. It is not clear whether this is due to
the distributed file system implementation or the MapReduce overhead.

3.2    Distributed Sort
The distributed sort program is a more interesting example because it involves a costly reduce
phase. Input data is read, all of it is written as intermediate data to the local disk buffer, it is

                                                  3
Figure 2: Effect of varying the size of map tasks. Left: 600 map tasks (16 MB input splits)
suffers from repeated initialization costs. Right: 75 map tasks (128 MB input splits) allows higher
sustained read rate, but the coarse granularity does not provide ideal load-balancing.


then transmitted over the network to the appropriate reduce task, which sorts by key, and finally
writes the output. The shuffle phase begins as soon as the fastest map task is done and ends only
after the slowest map task has finished. The reducer then sorts the data in memory and writes the
output with triple replication to the distributed file system.
    Using the Hadoop streaming protocol, the map function is an identity function which inserts a
tab to delimit key-value pairs:
#!/usr/bin/perl
while(<STDIN>) {
  s/(s)/t1/;
  print STDOUT;
}
The reduce function is an identity function which simply removes the tab:
#!/usr/bin/perl
while(<STDIN>) {
  s/t//;
  print STDOUT;
}
    Figure 2 demonstrates the effect of varying the number of reduce tasks (inversely, the size of the
reduce splits). When there are many reducers, there is good load balancing, as in Figure 2 (left).
Additionally, because sorting is O(n log n), having larger reduce splits will cause a greater amount
of time to be spent sorting the keys. This can be observed in Figure 2 (right): about 300 seconds
into the process, there is no data is being shuffled or written, because all reducers are busy sorting
the large number of keys assigned to them. In general, it would seem that having more reducers is
preferable; this of course must be weighed against the initialization overhead for each task, and the
practical inconvenience that each reducer writes to a different output file.
    Note that for this task the map phase achieves a peak read rate of about 200 MB/s, compared to
400 MB/s for the grep task. Because the map task must write its intermediate data to the local disk
buffer, it effectively halves the read rate. One potential solution to this problem is to use separate
hard disks for the local buffer and the local portion of the distributed file system; on the ICSI
desktops, for example, there is a small secondary hard disk which is intended for swap space but

                                                 4
could serve for the local buffer as well. Another possible solution is to implement a memory-based
buffer that allows streaming of intermediate data prior to the completion of a map task.2
    In the shuffle phase, the maximum transfer rate is about 150 MB/s; the theoretical limit, given
a 100 Mbps network of 80 machines with no other traffic, should be about 500 MB/s aggregate
bandwidth. Note that the transfer rates for the write phase of the reduce operations is considerably
less than the read rate; this is because each block of the output files is written to the distributed
file system as three replicas sent over the network to different machines. Thus, when the shuffle
phase overlaps with the write phase, as in Figure 2 (left), the network can be experiencing extra
traffic.


4       Applications to Machine Learning
Many standard machine learning algorithms follow one of a few canonical data processing patterns,
which we outline below. A large subset of these can be phrased as MapReduce tasks, illuminating
the benefits that the MapReduce framework offers to the machine learning community [1].
   In this section, we investigate the performance trade-offs of using MapReduce from an algorithm-
centric perspective, considering in turn three classes of ML algorithms and the issues of adapting
each to a MapReduce framework. The performance that results depends intimately on the design
choices underlying the MapReduce implementation, and how well those choices support the data
processing pattern of the ML algorithm. We conclude this section with a discussion of changes
and extensions to the Hadoop MapReduce implementation that would benefit the machine learning
community.

4.1      A Taxonomy of Standard Machine Learning Algorithms
While ML algorithms can be classified on many dimensions, the one we take primary interest in here
is that of procedural character: the data processing pattern of the algorithm. Here, we consider
single-pass, iterative and query-based learning techniques, along with several example algorithms
and applications.

4.1.1        Single-pass Learning
Many ML applications make only one pass through a data set, extracting relevant statistics for later
use during inference. This relatively simple learning setting arises often in natural language pro-
cessing, from machine translation to information extraction to spam filtering. These applications
often fit perfectly into the MapReduce abstraction, encapsulating the extraction of local contribu-
tions to the map task, then combining those contributions to compute relevant statistics about the
dataset as a whole. Consider the following examples, illustrating common decompositions of these
statistics.

Estimating Language Model Multinomials: Extracting language models from a large corpus
     amounts to little more than counting n-grams, though some parameter smoothing over the
     statistics is also common. The map phase enumerates the n-grams in each training instance
     (typically a sentence or paragraph), and the reduce function counts instances of n-grams.
    2 This   option has been investigated as part of Alex Rasmussen’s Hadoop-related CS 262 project this semester.




                                                           5
Feature Extraction for Naive Bayes Classifiers: Estimating parameters for a naive Bayes clas-
     sifier, or any fully observed Bayes net, again requires counting occurences in the training data.
     In this case, however, feature extraction is often computation-intensive, perhaps involving
     small search or optimization problems for each datum. The reduce task, however, remains a
     summation of each (feature, label) environment pair.

Syntactic Translation Modeling: Generating a syntactic model for machine translation is an
    example of a research-level machine learning application that involves only a single pass
    through a preprocessed training set. Each training datum consists of a pair of sentences in
    two languages, an estimated alignment between the words in each, and an estimated syntactic
    parse tree for one sentence.3 The per-datum feature extraction encapsulated in the map phase
    for this task involves search over these coupled data structures.

4.1.2     Iterative Learning
The class of iterative ML algorithms – perhaps the most common within the machine learning
research community – can also be expressed within the framework of MapReduce by chaining
together multiple MapReduce tasks [1]. While such algorithms vary widely in the type of operation
they perform on each datum (or pair of data) in a training set, they share the common characteristic
that a set of parameters is matched to the data set via iterative improvement. The update to these
parameters across iterations must again decompose into per-datum contributions, which is the case
of the example applications below. As with the examples discussed in the previous section, the
reduce function is considerably less compute-intensive than the map tasks.
    In the examples below, the contribution to parameter updates from each datum (the map
function) depends in a meaningful way on the output of the previous iteration. For example, the
expectation computation of EM or the inference computation in an SVM or perceptron classifier
can reference a large portion or all of the parameters generated by the algorithm. Hence, these
parameters must remain available to the map tasks in a distributed environment. The information
necessary to compute the map step of each algorithm is described below; the complications that
arise because this information is vital to the computation are investigated later in the paper.

Expectation Maximization (EM): The well-known EM algorithm maximizes the likelihood of
    a training set given a generative model with latent variables. The E-step of the algorithm
    computes posterior distributions over the latent variables given current model parameters
    and the observed data. The maximization step adjusts model parameters to maximize the
    likelihood of the data assuming that latent variables take on their expected values. Projecting
    onto the MapReduce framework, the map task computes posterior distributions over the latent
    variables of a datum using current model parameters; the maximization step is performed as
    a single reduction, which sums the sufficient statistics and normalizes to produce updated
    parameters.
        We consider applications for machine translation and speech recognition. For multivariate
        Gaussian mixture models (e.g., for speaker identification), these parameters are simply the
        mean vector and a covariance matrix. For HMM-GMM models (e.g., speech recognition),
        parameters are also needed to specify the state transition probabilities; the models, efficiently
   3 Generating appropriate training data for this task involves several applications of iterative learning algorithms,

described in the following section.



                                                          6
stored in binary form, occupy tens of megabytes. For word alignment models (e.g., ma-
        chine translation), these parameters include word-to-word translation probabilities; these can
        number in the millions, even after pruning heuristics remove the unnecessary parameters.
Discriminative Classification and Regression: When fitting model parameters via a percep-
     tron, boosting, or support vector machine algorithm for classification or regression, the map
     stage of training will involve computing inference over the training example given the current
     model parameters. Similar to the EM case, a subset of the parameters from the previous iter-
     ation must be available for inference. However, the reduce stage typically involves summing
     over parameter changes.
        Thus, all relevant model parameters must be broadcast to each map task. In the case of
        a typical featurized setting that often extracts hundreds or thousands of features from each
        training example, the relevant parameter space needed for inference can be quite large.

4.1.3     Query-based Learning with Distance Metrics
Finally, we consider distance-based ML applications that directly reference the training set dur-
ing inference, such as the nearest-neighbor classifier. In this setting, the training data are the
parameters, and a query instance must be compared to each training datum.
    While it is the case that multiple query instances can be processed simultaneously within a
MapReduce implementation of these techniques, the query set must be broadcast to all map tasks.
Again, we have a need for the distribution of state information. However, in this case, the query
information that must be distributed to all map tasks needn’t be processed concurrently – a query
set can be broken up and processed over multiple MapReduce operations. In the examples below,
each query instance tends to be of a manageable size.

K-nearest Neighbors Classifier The nearest-neighbor classifier compares each element of a query
    set to each element of a training set, and discovers examples with minimal distances from the
    queries. The map stage computes distance metrics, while the reduce stage tracks k examples
    for each label that have minimal distance to the query.
Similarity-based Search Finding the most similar instances to a given query has a similar char-
     acter, sifting through the training set to find examples that minimize a distance metric.
     Computing the distance is the map stage, while minimizing it is the reduce stage.

4.2      Performance and Implementation Issues
While the algorithms discussed above can all be implemented in parallel using the MapReduce ab-
straction, our example applications from each category revealed a set of implementation challenges.
We conducted all of our experiments on top of the Hadoop platform. In the discussion below, we
will address issues related both to the Hadoop implementation of MapReduce and the MapReduce
framework itself.

4.2.1     Single-pass Learning
The single-pass learning algorithms described in the previous section are clearly amenable to the
MapReduce framework. We will focus here on the task of generating a syntactic translation model
from a set of sentence pairs, their word-level bilingual alignment, and their syntactic structure

                                                   7
160
                                                                                        140
                              Local MapReduce
                  140                                                                                             Distribute Data
                                                                                        120
                              3-node MapReduce
                  120                                                                                             Extract Model
                              Local Reference                                           100
                  100
        Seconds




                                                                              Seconds
                                                                                        80
                   80


                   60                                                                   60


                   40                                                                   40


                   20
                                                                                        20

                    0
                        0      10     20        30   40    50        60                  0


                            Training (Thousands of sentence pairs)                            3-node MapReduce   Local Reference



Figure 3: Generating syntactic translation models with Hadoop: (left) The benefit of distributed computa-
tion quickly outweighs the overhead of a MapReduce implementation on a 3-node cluster. (right) Exporting
the data to the distributed file system incurs cost nearly equal to that of the computation itself.


(phrase structure tree). This task has several properties that distinguish it from benchmarking
applications like sort and grep.
   • The map stage processes several concurrent data sources (sentences, alignments and trees)
     that combine to form a complex data structure for each example.
   • The map stage computation requires non-trivial computation with potentially exponential
     asymptotic running time in the size of the syntax tree (limited in practice to be linear with a
     moderate coefficient). Additionally, the 4,000 line code base can consume tens of megabytes
     of memory during the processing of one example.
   • Hundreds of output keys can be generated for each training example.
    Figure 3 shows the running times for various input sizes, demonstrating the overhead of running
MapReduce relative to the reference implementation. The cost of running Hadoop cancels out some
of the benefit of parallelizing the code. Specifically, running on 3 machines gave a speed-up of 39%
over the reference implementation. The overhead of simulating a MapReduce computation on a
single machine was 51% of the compute cost of the reference implementation. Distributing the task
to a large cluster would clearly justify this overhead, but parallelizing to two machines would give
virtually no benefit for the largest data set size we tested.
    A more promising metric shows that as the size of the data scales, the distributed MapReduce
implementation maintains a low per-datum cost. We can isolate the variable-cost overhead of each
example by comparing the slopes of the curves in figure 3(a), which are all near-linear. The reference
implementation shows a variable computation cost of 1.7 seconds per 1000 examples, while the
distributed implementation across three machines shows a cost of 0.5 seconds per 1000 examples.
So, the variable overhead of MapReduce is minimal for this task, while the static overhead of
distributing the code base and channeling the processing through Hadoop’s infrastructure is large.
We would expect that substantially increasing the size of the training set would accentuate the
utility of distributing the computation.
    Thus far, we have assumed that the training data was already written to Hadoop’s distributed file
system (DFS). The cost of distributing data is relevant in this setting, however, due to drawbacks of

                                                                          8
Hadoop’s implementation of DFS. In the simple case of text processing, a training corpus need only
be distributed to Hadoop’s file system once, and can be processed by many different applications.
On the other hand, this application references four different input sources, including sentences,
alignments and parse trees for each example. When copying these resources independently to
the DFS, the Hadoop implementation gives no control over how those files are mapped to remote
machines. Hence, no one machine necessarily contains all of the data necessary for a given example.
    Instead of distributing the training corpus in its raw, multi-file form, we must bundle the relevant
parts of each example into a complex object and then transfer that object onto the Hadoop file
system, thereby guaranteeing that the parts of each training example remain grouped. Having to
repackage the data in this way has negative consequences; changes to the implementation of the
application require redistribution of the training data, a costly activity. Applications that share
some resources but not all will require redundant copies of their common subset. In sum, more
transfers to the DFS will be required during the typical course of research. Figure 3(b) shows the
cost breakdown of first distributing the training data to the DFS, then extracting statistics from
it. When we include the cost of distribution, the parallel implementation actually underperforms
the single-threaded reference. Because the training data must be bundled in an application-specific
format, one can expect to incur this distribution cost regularly.

4.2.2    Iterative Learning
Implementing two iterative learning algorithms gave rise to more challenges than the single-pass
example. As mentioned in the previous section, the map stage of each iteration depends upon
the parameters generated by the reduce phase of the previous iteration. We chose two example
applications: unsupervised word alignment for machine translation is an application of EM that
generates word-to-word alignments. The k-means algorithm is an unsupervised clustering technique
based on hard-EM.
    The word alignment task accentuates the problem of sharing state information because each pair
of word types that co-occur in a training example have their own parameter that must be referenced
during the map stage (E-step) of the algorithm. In a typical sentence pair of two 40-word sentences,
1,600 of these parameters must be available to the map function for each sentence pair.
    Bundling the relevant state information with each training example adds substantially to the
size of the training set. For instance, a pair of 40-word sentences would be accompanied by 6K of
parameter data on top of its 1k of text data. Even if parameters are bundled with small groups of
sentences, the overhead of bundling parameters is high (figure 4(a)).
    Rather than bundling parameters with the input data, we would prefer to broadcast these
parameters to all map nodes before each iteration of EM and reference them from within the
map tasks on those machines. However, not only does Hadoop not support static reference to
shared content across map tasks, the implementation also prevented us from adding this feature. In
Hadoop, each map task runs in its own Java Virtual Machine (JVM), leaving no access to a shared
memory space across multiple map tasks running on the same node.
    Hadoop does provide some minimal support for sharing parameters in that its streaming mode4
allows the distribution of a file to all compute nodes (once per machine) that can be referenced by
a map task. In this way, data can be shared across map tasks without incurring redundant transfer
costs. However, this work-around requires that the map task be written as a separate applications,
   4 The Hadoop streaming API allows arbitrary applications to compute the map and reduce functions by reference

to the operating system.



                                                       9
30.0                                                                 30




       Average Word Pairs Per English Word   25.0                                                                 25



                                             20.0                                                                 20




                                                                                                        Seconds
                                             15.0                                                                 15



                                             10.0                                                                 10



                                              5.0                                                                  5



                                              0.0                                                                  0
                                                    1   10           100            1000   10000
                                                                                                                       Broadcast                  Bundled
                                                             Training Bundle Size                                        Method of Distributing the Query



Figure 4: Benefits of Data Broadcasting: (left) The improved (smaller) ratio of relevant parameters
to training words when batch-processing data argues for parameter broadcasting over bundling. (right)
Bundling query data with training examples rather than broadcasting hinders a classifier’s performance.


eliminating the type-safety guarantees of an integrated java program. Furthermore, it still requires
that each map task load a distinct copy of its parameters into the heap of its JVM.
    We extended Hadoop to provide similar support for java-internal map and reduce functions.
Common state information is serialized and distributed to all compute nodes, where it can be
loaded and referenced statically. In particular, we support the dispersion of state information via a
networked file server to which all nodes have access. Note that this state file sharing is independent
of Hadoop’s distributed file system. We were still constrained to reload the state data for each map
task.

4.2.3                                        Query-based Learning with Distance Metrics
To investigate whether Hadoop’s MapReduce implementation was appropriate for distance-based
learning algorithms, we implemented a nearest neighbor classifier to predict the genre of a movie
from reviews provided by Netflix users.5
    Like iterative learning, the query-based setting requires that we distribute common data to the
map tasks: in this case, the query example. Figure 4(b) demonstrates the benefit of statically
broadcasting the query sentence with our extension, rather than bundling the query with each
training example.


5             Tailoring MapReduce for Machine Learning
5.1                                          Broadcasting and Parallel Files
MapReduce holds great potential for parallelizing machine learning applications. However, section
4.2 raises two major issues with the Hadoop implementation of MapReduce that we would hope to
address.
    First, we would advocate a shared space for the map tasks on a name node. Neither the MapRe-
duce framework nor the Hadoop implementation support efficient distribution and referencing of
    5 This
         classification approach makes the assumption that people tend to review many movies of the same genre,
and express rating preferences that are genre-dependent.



                                                                                                   10
common data (such as model parameters or query examples). The ability to broadcast state infor-
mation, whether filtered for each input or common across all map tasks, is necessary to support
many machine learning algorithms.
    The Hadoop streaming API provides a useful answer to this gap in functionality by allowing the
user to include a file that is distributed to each compute node. This file can store the necessary state
information, and is available to each map function. Our implementation of similar functionality for
java-internal map and reduce functions provides the means to distribute state information in an
efficient manner. However, the state information can only be shared across multiple map tasks via
reference to the file system because each runs in its own virtual machine. Ideally, a MapReduce
implementation for machine learning would expose shared memory to all map tasks on a compute
node to eliminate this overhead.
    Second, we desired the ability to tie parallel files together on the distributed file system, and
allow the input reader for MapReduce tasks to parse multiple parallel files. Co-locating mutually
referential file segments in this way would eliminate the need to bundle input data in application-
specific formats. For instance, text could be decoupled from its many possible annotations (parse
trees, part-of-speech tags, and word alignment, for example) on the DFS. With this functionality,
a data set could reside permanently on the DFS, shared across multiple applications that would
perhaps combine it with other related data on an ad hoc basis.
    Implementing support for tying together parallel files proved challenging within the Hadoop
source code. Much of the existing code assumes that input data is packaged in one file that can be
distributed across the network independently of other data.

5.2    Static Type Checking and Convenience Methods
Hadoop proved somewhat difficult to use for reasons that we have also addressed during the course
of this project. First, the architecture expects that a Hadoop binary will be invoked from the
command line for all MapReduce jobs, which unpacks a referenced jar containing map and reduce
functions. The command-line tools also awkwardly decouple manipulating the DFS from running
MapReduce jobs. We prefer an architecture that invokes Hadoop DFS and MapReduce tasks
from within our own applications, integrating the standard sequence of serialization and loading
data on the DFS, running jobs, and retrieving distributed output. We have built a task manager
that conveniently provides this array of functionality, along with translating between Hadoop’s
proprietary serialization and the Java standard.
    Our task manager also provides another missing feature: proper static type checking. Hadoop’s
heavy use of Java’s reflection facilities leave many type errors unchecked until runtime. Through our
lightweight wrapper package for Hadoop that manages interactions, we enforce static type checking
across data distribution, map and reduce functions, and data retrieval. All unchecked type casts
and reflection operations are encapsulated within our package, shielding users from the risks of
calling Hadoop’s functions directly.
    Finally, we provide facilities for launching MapReduce jobs remotely without requiring that the
request originator install the array of binaries that are bundled with Hadoop. Using our package,
no installation is required to originate jobs.
    These convenience improvements allowed for our rapid development of several machine learning
applications. We hope to release our package to the Berkeley community in short order.




                                                 11
6    Conclusion
By virtue of its simplicity and fault tolerance, MapReduce proved to be an admirable gateway to
parallelizing machine learning applications. The benefits of easy development and robust compu-
tation did come at a price in terms of performance, however. Furthermore, proper tuning of the
system led to substantial variance in performance, as we saw in section 3 (Benchmarks).
    Defining common settings for machine learning algorithms lead us to discover the core short-
comings of Hadoop’s MapReduce implementation. We were able to address the most significant,
the need to broadcast data to map tasks. We also greatly improved the convenience of running
MapReduce jobs from within existing Java applications. However, the ability to tie together the
distribution of parallel files on the DFS remains an outstanding challenge.
    All in all, MapReduce represents a promising direction for future machine learning implementa-
tions. When continuing to develop Hadoop and tools that surround it, we must strive to minimize
the compromises between convenience and performance, providing a platform that allows for effi-
cient processing and rapid application development.


References
[1] Cheng-Tao Chu et al. Map-Reduce for machine learning on multicore. In NIPS, 2007.
[2] Doug Cutting et al. About hadoop. http://lucene.apache.org/hadoop/about.html.
[3] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters.
    In ACM OSDI, 2004.
[4] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In ACM
    SOSP, 2003.
[5] Jim Gray. Sort benchmark homepage. http://research.microsoft.com/barc/SortBenchmark.
[6] Jim Gray et al. Scientific data managing in the coming decade. Technical Report MSR-TR-
    2005-10, Microsoft Research, 2005.




                                               12

Weitere ähnliche Inhalte

Was ist angesagt?

Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 

Was ist angesagt? (20)

Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
MapReduce
MapReduceMapReduce
MapReduce
 
Unit 1
Unit 1Unit 1
Unit 1
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
final report
final reportfinal report
final report
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 

Andere mochten auch (7)

hadoop
hadoophadoop
hadoop
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 

Ähnlich wie MapReduce: Distributed Computing for Machine Learning

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 

Ähnlich wie MapReduce: Distributed Computing for Machine Learning (20)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
E031201032036
E031201032036E031201032036
E031201032036
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

MapReduce: Distributed Computing for Machine Learning

  • 1. MapReduce: Distributed Computing for Machine Learning Dan Gillick, Arlo Faria, John DeNero December 18, 2006 Abstract We use Hadoop, an open-source implementation of Google’s distributed file system and the MapReduce framework for distributed data processing, on modestly-sized compute clusters to evaluate its efficacy for standard machine learning tasks. We show benchmark performance on searching and sorting tasks to investigate the effects of various system configurations. We also distinguish classes of machine-learning problems that are reasonable to address within the MapReduce framework, and offer improvements to the Hadoop implementation. We conclude that MapReduce is a good choice for basic operations on large datasets, although there are complications to be addressed for more complex machine learning tasks. 1 Introduction The Internet provides a resource for compiling enormous amounts of data, often beyond the capacity of individual disks, and too large for processing with a single CPU. Google’s MapReduce [3], built on top of the distributed Google File System [4] provides a parallelization framework which has garnered considerable acclaim for its ease-of-use, scalability, and fault-tolerance. As evidence of its utility, Google points out that thousands of MapReduce applications have already been written by its employees in the short time since the system was deployed [3]. Recent conference papers have investigated MapReduce applications for machine learning algorithms run on multicore machines [1] and MapReduce is discussed by systems luminary Jim Gray in a recent position paper [6]. Here at Berkeley, there is even discussion of incorporating MapReduce programming into undergraduate Computer Science classes as an introduction to parallel computing. The success at Google prompted the development of the Hadoop project [2], an open-source attempt to reproduce Google’s implementation, hosted as a sub-project of the Apache Software Foundation’s Lucene search engine library. Hadoop is still in early stages of development (latest version: 0.92) and has not been tested on clusters of more than 1000 machines (Google, by contrast, often runs jobs on many thousands of machines). The discussion that follows addresses the practical application of the MapReduce framework to our research interests. Section 2 introduces the compute environments at our disposal. Section 3 provides benchmark performance on standard MapReduce tasks. Section 4 outlines applications to a few relevant ma- chine learning problems, discussing the virtues and limitations of MapReduce. Section 5 discusses our improvements to the Hadoop implementation. 1
  • 2. 2 Compute Resources While large companies may have thousands of networked computers, most academic researchers have access to more modest resources. We installed Hadoop on the NLP Millennium cluster in Soda Hall and at the International Computer Science Institute (ICSI), which are representative of typical university research environments. Table 1 gives the details of the two cluster configurations used for the experiments in this paper. The NLP cluster is a rack of high-performance computing servers, whereas the ICSI cluster comprises a much larger number of personal desktop machines. It should be noted that ICSI also has a compute cluster of dedicated servers: over 20 machines, each with up to 8 multicore processors and 32 GB RAM, and a 1 Gbps link to terabyte RAID arrays. Compute jobs at ICSI are typically exported to these compute servers using pmake-customs software.1 For this paper, we demonstrate that the desktop machines at ICSI can also be used for effective parallel programming, providing an especially useful alternative for I/O-intensive applications which would otherwise be crippled by the pmake-customs network bottleneck. NLP Cluster ICSI Desktop Cluster Machines 9 80 CPUs per machine 2x 3.4 GHz 1x 2.4 GHz RAM per machine 4 GB 1 GB Storage per machine 300 GB 30 GB Network Bandwidth 1 GBit/s 100 MBit/s Table 1: Two clusters used in this paper’s experiments. 3 Benchmarks We evaluate the performance of the 80-node ICSI desktop cluster to establish benchmark results for given tasks and configurations. Following the presentation in [3], we consider two computations that typify the majority of MapReduce algorithms: a grep that searches through a large amount of data, and a sort that transforms it from one representation to another. These two tasks represent extreme cases for the reduce phase, in which either none or all of the input data is shuffled across the network and written as output. The data in these experiments are derived from the TeraSort benchmark [5], in which 100-byte records comprise a random 10-byte sorting key and 90 bytes of “payload” data. In the comparable experiments in [3], a cluster of 1800 machines was used to process a terabyte of data; due to our limited resources, we scaled the experiments down to 10 gigabytes of data – i.e., our input was 108 100-byte records. We display our results in the same format as [3], showing the data transfer rates over the time of the computation, distinguishing the map phase (reading input from local disks) from the reduce phase (shuffling intermediate data over the network and writing the output to the distributed file system). 1 This was Adam de Boor’s CS 262 final project from 1989, and was updated by Andreas Stolcke for use at ICSI and SRI. 2
  • 3. Figure 1: Effect of varying the size of map tasks. Left: with 600 map tasks (16MB input splits), performance suffers from repeated initialization costs. Right: with 75 map tasks (128MB input splits), a high sustained read rate is attainable, but the coarse granularity does not allow good load-balancing. 3.1 Distributed Grep The distributed grep program searches for a character pattern and writes intermediate data only when a match is encountered. To simplify this further, we can consider an empty map function, which reads the input and never writes intermediate records. Using the Hadoop streaming protocol [2], which reads records from STDIN and writes to STDOUT, the map function is trivial: #!/usr/bin/perl while(<STDIN>){ # do nothing } The reduce function is not needed since there is no intermediate data. Thus, this contrived program can be used to measure the maximal input data read rate for the map phase. Figure 1 shows the effect of varying the number of map tasks (equivalent to varying the size of an input split). When the number of map tasks is high, reasonable load balancing is possible because faster machines will have a chance to run more tasks than slower machines. Also, because the tasks finish more quickly, the variance in running times is lower, and the effect of long-running “stragglers” is minimized. However, each map task incurs some amount of startup overhead, evidenced by the periodic bursts of data transfer in Figure 1 (left) caused by many tasks finishing and starting at the same time. Thus, it could be desirable to minimize initialization overhead by having a smaller number of map tasks, at the expense of poor load balancing, as in Figure 1 (right). Our 80-node cluster reaches a maximum data read rate of about 400 MB/s, or about 5 MB/s per node. This is worse than Google’s 1800-node cluster [3], which peaked at 30GB/s, or about 15 MB/s per node. The individual desktop hard disks at ICSI were probably mostly idle at the time, and were capable of reaching about 30-40 MB/s read rate (cold cache results), so the cluster’s performance is far from its theoretical maximum efficiency. It is not clear whether this is due to the distributed file system implementation or the MapReduce overhead. 3.2 Distributed Sort The distributed sort program is a more interesting example because it involves a costly reduce phase. Input data is read, all of it is written as intermediate data to the local disk buffer, it is 3
  • 4. Figure 2: Effect of varying the size of map tasks. Left: 600 map tasks (16 MB input splits) suffers from repeated initialization costs. Right: 75 map tasks (128 MB input splits) allows higher sustained read rate, but the coarse granularity does not provide ideal load-balancing. then transmitted over the network to the appropriate reduce task, which sorts by key, and finally writes the output. The shuffle phase begins as soon as the fastest map task is done and ends only after the slowest map task has finished. The reducer then sorts the data in memory and writes the output with triple replication to the distributed file system. Using the Hadoop streaming protocol, the map function is an identity function which inserts a tab to delimit key-value pairs: #!/usr/bin/perl while(<STDIN>) { s/(s)/t1/; print STDOUT; } The reduce function is an identity function which simply removes the tab: #!/usr/bin/perl while(<STDIN>) { s/t//; print STDOUT; } Figure 2 demonstrates the effect of varying the number of reduce tasks (inversely, the size of the reduce splits). When there are many reducers, there is good load balancing, as in Figure 2 (left). Additionally, because sorting is O(n log n), having larger reduce splits will cause a greater amount of time to be spent sorting the keys. This can be observed in Figure 2 (right): about 300 seconds into the process, there is no data is being shuffled or written, because all reducers are busy sorting the large number of keys assigned to them. In general, it would seem that having more reducers is preferable; this of course must be weighed against the initialization overhead for each task, and the practical inconvenience that each reducer writes to a different output file. Note that for this task the map phase achieves a peak read rate of about 200 MB/s, compared to 400 MB/s for the grep task. Because the map task must write its intermediate data to the local disk buffer, it effectively halves the read rate. One potential solution to this problem is to use separate hard disks for the local buffer and the local portion of the distributed file system; on the ICSI desktops, for example, there is a small secondary hard disk which is intended for swap space but 4
  • 5. could serve for the local buffer as well. Another possible solution is to implement a memory-based buffer that allows streaming of intermediate data prior to the completion of a map task.2 In the shuffle phase, the maximum transfer rate is about 150 MB/s; the theoretical limit, given a 100 Mbps network of 80 machines with no other traffic, should be about 500 MB/s aggregate bandwidth. Note that the transfer rates for the write phase of the reduce operations is considerably less than the read rate; this is because each block of the output files is written to the distributed file system as three replicas sent over the network to different machines. Thus, when the shuffle phase overlaps with the write phase, as in Figure 2 (left), the network can be experiencing extra traffic. 4 Applications to Machine Learning Many standard machine learning algorithms follow one of a few canonical data processing patterns, which we outline below. A large subset of these can be phrased as MapReduce tasks, illuminating the benefits that the MapReduce framework offers to the machine learning community [1]. In this section, we investigate the performance trade-offs of using MapReduce from an algorithm- centric perspective, considering in turn three classes of ML algorithms and the issues of adapting each to a MapReduce framework. The performance that results depends intimately on the design choices underlying the MapReduce implementation, and how well those choices support the data processing pattern of the ML algorithm. We conclude this section with a discussion of changes and extensions to the Hadoop MapReduce implementation that would benefit the machine learning community. 4.1 A Taxonomy of Standard Machine Learning Algorithms While ML algorithms can be classified on many dimensions, the one we take primary interest in here is that of procedural character: the data processing pattern of the algorithm. Here, we consider single-pass, iterative and query-based learning techniques, along with several example algorithms and applications. 4.1.1 Single-pass Learning Many ML applications make only one pass through a data set, extracting relevant statistics for later use during inference. This relatively simple learning setting arises often in natural language pro- cessing, from machine translation to information extraction to spam filtering. These applications often fit perfectly into the MapReduce abstraction, encapsulating the extraction of local contribu- tions to the map task, then combining those contributions to compute relevant statistics about the dataset as a whole. Consider the following examples, illustrating common decompositions of these statistics. Estimating Language Model Multinomials: Extracting language models from a large corpus amounts to little more than counting n-grams, though some parameter smoothing over the statistics is also common. The map phase enumerates the n-grams in each training instance (typically a sentence or paragraph), and the reduce function counts instances of n-grams. 2 This option has been investigated as part of Alex Rasmussen’s Hadoop-related CS 262 project this semester. 5
  • 6. Feature Extraction for Naive Bayes Classifiers: Estimating parameters for a naive Bayes clas- sifier, or any fully observed Bayes net, again requires counting occurences in the training data. In this case, however, feature extraction is often computation-intensive, perhaps involving small search or optimization problems for each datum. The reduce task, however, remains a summation of each (feature, label) environment pair. Syntactic Translation Modeling: Generating a syntactic model for machine translation is an example of a research-level machine learning application that involves only a single pass through a preprocessed training set. Each training datum consists of a pair of sentences in two languages, an estimated alignment between the words in each, and an estimated syntactic parse tree for one sentence.3 The per-datum feature extraction encapsulated in the map phase for this task involves search over these coupled data structures. 4.1.2 Iterative Learning The class of iterative ML algorithms – perhaps the most common within the machine learning research community – can also be expressed within the framework of MapReduce by chaining together multiple MapReduce tasks [1]. While such algorithms vary widely in the type of operation they perform on each datum (or pair of data) in a training set, they share the common characteristic that a set of parameters is matched to the data set via iterative improvement. The update to these parameters across iterations must again decompose into per-datum contributions, which is the case of the example applications below. As with the examples discussed in the previous section, the reduce function is considerably less compute-intensive than the map tasks. In the examples below, the contribution to parameter updates from each datum (the map function) depends in a meaningful way on the output of the previous iteration. For example, the expectation computation of EM or the inference computation in an SVM or perceptron classifier can reference a large portion or all of the parameters generated by the algorithm. Hence, these parameters must remain available to the map tasks in a distributed environment. The information necessary to compute the map step of each algorithm is described below; the complications that arise because this information is vital to the computation are investigated later in the paper. Expectation Maximization (EM): The well-known EM algorithm maximizes the likelihood of a training set given a generative model with latent variables. The E-step of the algorithm computes posterior distributions over the latent variables given current model parameters and the observed data. The maximization step adjusts model parameters to maximize the likelihood of the data assuming that latent variables take on their expected values. Projecting onto the MapReduce framework, the map task computes posterior distributions over the latent variables of a datum using current model parameters; the maximization step is performed as a single reduction, which sums the sufficient statistics and normalizes to produce updated parameters. We consider applications for machine translation and speech recognition. For multivariate Gaussian mixture models (e.g., for speaker identification), these parameters are simply the mean vector and a covariance matrix. For HMM-GMM models (e.g., speech recognition), parameters are also needed to specify the state transition probabilities; the models, efficiently 3 Generating appropriate training data for this task involves several applications of iterative learning algorithms, described in the following section. 6
  • 7. stored in binary form, occupy tens of megabytes. For word alignment models (e.g., ma- chine translation), these parameters include word-to-word translation probabilities; these can number in the millions, even after pruning heuristics remove the unnecessary parameters. Discriminative Classification and Regression: When fitting model parameters via a percep- tron, boosting, or support vector machine algorithm for classification or regression, the map stage of training will involve computing inference over the training example given the current model parameters. Similar to the EM case, a subset of the parameters from the previous iter- ation must be available for inference. However, the reduce stage typically involves summing over parameter changes. Thus, all relevant model parameters must be broadcast to each map task. In the case of a typical featurized setting that often extracts hundreds or thousands of features from each training example, the relevant parameter space needed for inference can be quite large. 4.1.3 Query-based Learning with Distance Metrics Finally, we consider distance-based ML applications that directly reference the training set dur- ing inference, such as the nearest-neighbor classifier. In this setting, the training data are the parameters, and a query instance must be compared to each training datum. While it is the case that multiple query instances can be processed simultaneously within a MapReduce implementation of these techniques, the query set must be broadcast to all map tasks. Again, we have a need for the distribution of state information. However, in this case, the query information that must be distributed to all map tasks needn’t be processed concurrently – a query set can be broken up and processed over multiple MapReduce operations. In the examples below, each query instance tends to be of a manageable size. K-nearest Neighbors Classifier The nearest-neighbor classifier compares each element of a query set to each element of a training set, and discovers examples with minimal distances from the queries. The map stage computes distance metrics, while the reduce stage tracks k examples for each label that have minimal distance to the query. Similarity-based Search Finding the most similar instances to a given query has a similar char- acter, sifting through the training set to find examples that minimize a distance metric. Computing the distance is the map stage, while minimizing it is the reduce stage. 4.2 Performance and Implementation Issues While the algorithms discussed above can all be implemented in parallel using the MapReduce ab- straction, our example applications from each category revealed a set of implementation challenges. We conducted all of our experiments on top of the Hadoop platform. In the discussion below, we will address issues related both to the Hadoop implementation of MapReduce and the MapReduce framework itself. 4.2.1 Single-pass Learning The single-pass learning algorithms described in the previous section are clearly amenable to the MapReduce framework. We will focus here on the task of generating a syntactic translation model from a set of sentence pairs, their word-level bilingual alignment, and their syntactic structure 7
  • 8. 160 140 Local MapReduce 140 Distribute Data 120 3-node MapReduce 120 Extract Model Local Reference 100 100 Seconds Seconds 80 80 60 60 40 40 20 20 0 0 10 20 30 40 50 60 0 Training (Thousands of sentence pairs) 3-node MapReduce Local Reference Figure 3: Generating syntactic translation models with Hadoop: (left) The benefit of distributed computa- tion quickly outweighs the overhead of a MapReduce implementation on a 3-node cluster. (right) Exporting the data to the distributed file system incurs cost nearly equal to that of the computation itself. (phrase structure tree). This task has several properties that distinguish it from benchmarking applications like sort and grep. • The map stage processes several concurrent data sources (sentences, alignments and trees) that combine to form a complex data structure for each example. • The map stage computation requires non-trivial computation with potentially exponential asymptotic running time in the size of the syntax tree (limited in practice to be linear with a moderate coefficient). Additionally, the 4,000 line code base can consume tens of megabytes of memory during the processing of one example. • Hundreds of output keys can be generated for each training example. Figure 3 shows the running times for various input sizes, demonstrating the overhead of running MapReduce relative to the reference implementation. The cost of running Hadoop cancels out some of the benefit of parallelizing the code. Specifically, running on 3 machines gave a speed-up of 39% over the reference implementation. The overhead of simulating a MapReduce computation on a single machine was 51% of the compute cost of the reference implementation. Distributing the task to a large cluster would clearly justify this overhead, but parallelizing to two machines would give virtually no benefit for the largest data set size we tested. A more promising metric shows that as the size of the data scales, the distributed MapReduce implementation maintains a low per-datum cost. We can isolate the variable-cost overhead of each example by comparing the slopes of the curves in figure 3(a), which are all near-linear. The reference implementation shows a variable computation cost of 1.7 seconds per 1000 examples, while the distributed implementation across three machines shows a cost of 0.5 seconds per 1000 examples. So, the variable overhead of MapReduce is minimal for this task, while the static overhead of distributing the code base and channeling the processing through Hadoop’s infrastructure is large. We would expect that substantially increasing the size of the training set would accentuate the utility of distributing the computation. Thus far, we have assumed that the training data was already written to Hadoop’s distributed file system (DFS). The cost of distributing data is relevant in this setting, however, due to drawbacks of 8
  • 9. Hadoop’s implementation of DFS. In the simple case of text processing, a training corpus need only be distributed to Hadoop’s file system once, and can be processed by many different applications. On the other hand, this application references four different input sources, including sentences, alignments and parse trees for each example. When copying these resources independently to the DFS, the Hadoop implementation gives no control over how those files are mapped to remote machines. Hence, no one machine necessarily contains all of the data necessary for a given example. Instead of distributing the training corpus in its raw, multi-file form, we must bundle the relevant parts of each example into a complex object and then transfer that object onto the Hadoop file system, thereby guaranteeing that the parts of each training example remain grouped. Having to repackage the data in this way has negative consequences; changes to the implementation of the application require redistribution of the training data, a costly activity. Applications that share some resources but not all will require redundant copies of their common subset. In sum, more transfers to the DFS will be required during the typical course of research. Figure 3(b) shows the cost breakdown of first distributing the training data to the DFS, then extracting statistics from it. When we include the cost of distribution, the parallel implementation actually underperforms the single-threaded reference. Because the training data must be bundled in an application-specific format, one can expect to incur this distribution cost regularly. 4.2.2 Iterative Learning Implementing two iterative learning algorithms gave rise to more challenges than the single-pass example. As mentioned in the previous section, the map stage of each iteration depends upon the parameters generated by the reduce phase of the previous iteration. We chose two example applications: unsupervised word alignment for machine translation is an application of EM that generates word-to-word alignments. The k-means algorithm is an unsupervised clustering technique based on hard-EM. The word alignment task accentuates the problem of sharing state information because each pair of word types that co-occur in a training example have their own parameter that must be referenced during the map stage (E-step) of the algorithm. In a typical sentence pair of two 40-word sentences, 1,600 of these parameters must be available to the map function for each sentence pair. Bundling the relevant state information with each training example adds substantially to the size of the training set. For instance, a pair of 40-word sentences would be accompanied by 6K of parameter data on top of its 1k of text data. Even if parameters are bundled with small groups of sentences, the overhead of bundling parameters is high (figure 4(a)). Rather than bundling parameters with the input data, we would prefer to broadcast these parameters to all map nodes before each iteration of EM and reference them from within the map tasks on those machines. However, not only does Hadoop not support static reference to shared content across map tasks, the implementation also prevented us from adding this feature. In Hadoop, each map task runs in its own Java Virtual Machine (JVM), leaving no access to a shared memory space across multiple map tasks running on the same node. Hadoop does provide some minimal support for sharing parameters in that its streaming mode4 allows the distribution of a file to all compute nodes (once per machine) that can be referenced by a map task. In this way, data can be shared across map tasks without incurring redundant transfer costs. However, this work-around requires that the map task be written as a separate applications, 4 The Hadoop streaming API allows arbitrary applications to compute the map and reduce functions by reference to the operating system. 9
  • 10. 30.0 30 Average Word Pairs Per English Word 25.0 25 20.0 20 Seconds 15.0 15 10.0 10 5.0 5 0.0 0 1 10 100 1000 10000 Broadcast Bundled Training Bundle Size Method of Distributing the Query Figure 4: Benefits of Data Broadcasting: (left) The improved (smaller) ratio of relevant parameters to training words when batch-processing data argues for parameter broadcasting over bundling. (right) Bundling query data with training examples rather than broadcasting hinders a classifier’s performance. eliminating the type-safety guarantees of an integrated java program. Furthermore, it still requires that each map task load a distinct copy of its parameters into the heap of its JVM. We extended Hadoop to provide similar support for java-internal map and reduce functions. Common state information is serialized and distributed to all compute nodes, where it can be loaded and referenced statically. In particular, we support the dispersion of state information via a networked file server to which all nodes have access. Note that this state file sharing is independent of Hadoop’s distributed file system. We were still constrained to reload the state data for each map task. 4.2.3 Query-based Learning with Distance Metrics To investigate whether Hadoop’s MapReduce implementation was appropriate for distance-based learning algorithms, we implemented a nearest neighbor classifier to predict the genre of a movie from reviews provided by Netflix users.5 Like iterative learning, the query-based setting requires that we distribute common data to the map tasks: in this case, the query example. Figure 4(b) demonstrates the benefit of statically broadcasting the query sentence with our extension, rather than bundling the query with each training example. 5 Tailoring MapReduce for Machine Learning 5.1 Broadcasting and Parallel Files MapReduce holds great potential for parallelizing machine learning applications. However, section 4.2 raises two major issues with the Hadoop implementation of MapReduce that we would hope to address. First, we would advocate a shared space for the map tasks on a name node. Neither the MapRe- duce framework nor the Hadoop implementation support efficient distribution and referencing of 5 This classification approach makes the assumption that people tend to review many movies of the same genre, and express rating preferences that are genre-dependent. 10
  • 11. common data (such as model parameters or query examples). The ability to broadcast state infor- mation, whether filtered for each input or common across all map tasks, is necessary to support many machine learning algorithms. The Hadoop streaming API provides a useful answer to this gap in functionality by allowing the user to include a file that is distributed to each compute node. This file can store the necessary state information, and is available to each map function. Our implementation of similar functionality for java-internal map and reduce functions provides the means to distribute state information in an efficient manner. However, the state information can only be shared across multiple map tasks via reference to the file system because each runs in its own virtual machine. Ideally, a MapReduce implementation for machine learning would expose shared memory to all map tasks on a compute node to eliminate this overhead. Second, we desired the ability to tie parallel files together on the distributed file system, and allow the input reader for MapReduce tasks to parse multiple parallel files. Co-locating mutually referential file segments in this way would eliminate the need to bundle input data in application- specific formats. For instance, text could be decoupled from its many possible annotations (parse trees, part-of-speech tags, and word alignment, for example) on the DFS. With this functionality, a data set could reside permanently on the DFS, shared across multiple applications that would perhaps combine it with other related data on an ad hoc basis. Implementing support for tying together parallel files proved challenging within the Hadoop source code. Much of the existing code assumes that input data is packaged in one file that can be distributed across the network independently of other data. 5.2 Static Type Checking and Convenience Methods Hadoop proved somewhat difficult to use for reasons that we have also addressed during the course of this project. First, the architecture expects that a Hadoop binary will be invoked from the command line for all MapReduce jobs, which unpacks a referenced jar containing map and reduce functions. The command-line tools also awkwardly decouple manipulating the DFS from running MapReduce jobs. We prefer an architecture that invokes Hadoop DFS and MapReduce tasks from within our own applications, integrating the standard sequence of serialization and loading data on the DFS, running jobs, and retrieving distributed output. We have built a task manager that conveniently provides this array of functionality, along with translating between Hadoop’s proprietary serialization and the Java standard. Our task manager also provides another missing feature: proper static type checking. Hadoop’s heavy use of Java’s reflection facilities leave many type errors unchecked until runtime. Through our lightweight wrapper package for Hadoop that manages interactions, we enforce static type checking across data distribution, map and reduce functions, and data retrieval. All unchecked type casts and reflection operations are encapsulated within our package, shielding users from the risks of calling Hadoop’s functions directly. Finally, we provide facilities for launching MapReduce jobs remotely without requiring that the request originator install the array of binaries that are bundled with Hadoop. Using our package, no installation is required to originate jobs. These convenience improvements allowed for our rapid development of several machine learning applications. We hope to release our package to the Berkeley community in short order. 11
  • 12. 6 Conclusion By virtue of its simplicity and fault tolerance, MapReduce proved to be an admirable gateway to parallelizing machine learning applications. The benefits of easy development and robust compu- tation did come at a price in terms of performance, however. Furthermore, proper tuning of the system led to substantial variance in performance, as we saw in section 3 (Benchmarks). Defining common settings for machine learning algorithms lead us to discover the core short- comings of Hadoop’s MapReduce implementation. We were able to address the most significant, the need to broadcast data to map tasks. We also greatly improved the convenience of running MapReduce jobs from within existing Java applications. However, the ability to tie together the distribution of parallel files on the DFS remains an outstanding challenge. All in all, MapReduce represents a promising direction for future machine learning implementa- tions. When continuing to develop Hadoop and tools that surround it, we must strive to minimize the compromises between convenience and performance, providing a platform that allows for effi- cient processing and rapid application development. References [1] Cheng-Tao Chu et al. Map-Reduce for machine learning on multicore. In NIPS, 2007. [2] Doug Cutting et al. About hadoop. http://lucene.apache.org/hadoop/about.html. [3] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In ACM OSDI, 2004. [4] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In ACM SOSP, 2003. [5] Jim Gray. Sort benchmark homepage. http://research.microsoft.com/barc/SortBenchmark. [6] Jim Gray et al. Scientific data managing in the coming decade. Technical Report MSR-TR- 2005-10, Microsoft Research, 2005. 12