Architectural details of the computing clusters (Hadoop) and designing MapReduce applications.
Matrix multiplication code is here: https://github.com/marufaytekin/MatrixMultiply
Details are on this post: https://lendap.wordpress.com/2015/02/16/matrix-multiplication-with-mapreduce/
4. Important Examples
• The ranking of Web pages by importance,
which involves an iterated matrix-vector
multiplication where the dimension is
many billions.
• Searches in social-networking sites, which
involve graphs with hundreds of millions
of nodes and many billions of edges.
• Processing large amount of text or
streams such as news recommendation.
5. New software stack
• Not a “supercomputer” (Beowulf etc.)
• “computing clusters” – large collections of
commodity hardware, including conventional
processors (“compute nodes”) connected by
Ethernet cables or inexpensive switches.
6. Distributed File System
• The new form of file system which features
much larger units than the disk blocks in a
conventional operating system.
• Files can be enormous, possibly a terabytes
in size.
• Files are rarely updated.
14. MapReduce
In brief, a MapReduce computation executes as follows:
• Chunks from a DFS are given to Map tasks.
• These Map tasks turn the chunks into a sequence of
<key, value> pairs.
• The <key,value> pairs from each Map task are
collected by a master controller and sorted by key.
(Combine)
• The keys are divided among all the Reduce tasks, so
all <key,value> pairs with the same key wind up at
the same Reduce task.
• The Reduce tasks work on one key at a time and
processes values for that key then outputs the results
as <key,value> pairs.
17. Word Count
For the given sample input
the first map emits:
< Hello, 1 >
< World, 1 >
< Bye, 1 >
< World, 1 >
The second map emits:
< Hello, 1 >
< Hadoop, 1 >
< Goodbye, 1 >
< Hadoop, 1 >
Combiner:
After being sorted on the keys:
The output of the first map:
< Bye, 1 >
< Hello, 1 >
< World, 2 >
The output of the second map:
< Goodbye, 1 >
< Hadoop, 2 >
< Hello, 1 >
18. Word Count
Thus the output of the job is:
< Bye, 1 >
< Goodbye, 1 >
< Hadoop, 2 >
< Hello, 2 >
< World, 2 >
The Reducer implementation, via the reduce method just sums up
the values, which are the occurrence counts for each key.