3. Recall
•As the Map operation is parallelized the input file set is firstsplit to several pieces calledFileSplits.
•Then a new map task is created per FileSplit.
•When an individual map task starts it will open a new outputwriter per configured reduce task.
•It will then proceed to readits FileSplitusing theRecordReaderit gets from the specifiedInputFormat.
•InputFormatparses the input and generateskey-value pairs.
•As key-value pairs are read from the RecordReaderthey arepassed to the configuredMapper.
•The user supplied Mapperdoeswhatever it wants with the input pair and callsOutputCollector.collectwith key-value pairs of its own choosing. The Map output is written into a SequenceFile.
4. Partitioner
•When Mapperoutput is collected it is partitioned, which meansthat it will be written to the output specified by thePartitioner.
•Partitionersare responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers.
•In other words, the partitionerspecifies the task to which an intermediate key-value pair must be copied.
•simplest partitionerinvolves computing the hash value of the key and then taking the mod of that value with the number of reducers.
•This assigns approximately the same number of keys to each reducer (dependent on the quality of the hash function).
5. Imbalance in partitioning
•The partitioneronly considers the key and ignores the value
•Therefore, a roughly-even partitioning of the key space may nevertheless yield large differences in the number of key-values pairs sent to each reducer
•This imbalance in the amount of data associated with each key is relatively common in many text processing applications due to the Zipfiandistribution of word occurrences.
•Zipf'slaw states that given somecorpusofnatural languageutterances, the frequency of any word isinversely proportionalto its rank in the frequency table.
•Thus the most frequent word will occur approximately twice as often as the second most frequentword, three times as often as thethird most frequent word, etc.
6. Combiner
•Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and sort phase.
•When the mapoperation outputs its pairs they are already available inmemory.
•For efficiency reasons, sometimes it makes sense totake advantage of this fact by supplying a combiner class to perform a reduce-type function.
•If a combiner is used then themap key-value pairs are not immediately written to the output.
•Instead theywill be collected in lists, one list per each key value.
•When acertain number of key-value pairs have been written,
•This bufferis flushed by passing all the values of each key to thecombiner's reduce method and
•outputting the key-value pairs of the combine operation as if they were created by the original mapoperation.
8. Example
•A word count MapReduce application whose mapoperation outputs (word, 1) pairs as words are encountered inthe input can use a combiner to speed up processing.
•A combineoperation will start gathering the output in in-memory lists (insteadof on disk), one list per word.
•Once a certain number ofpairs is output, the combine operation will be called once perunique word with the list available as an iterator.
•The combinerthen emits (word, count-in-this-part-of-the-input) pairs.
•Fromthe viewpoint of the Reduce operation this contains the same information as the original Map output, but there should be far fewerpairs output to disk and read from disk.
9. When combiner is used
•Combiner helps to reduce the amount of data transferred from themapperto thereducers andsave as much priceless bandwidth as possible.
•In many cases it can dramatically increase the efficiency of mapreducejob.
•Because combiner is an optimization, Hadoop does not guarantee how many times a combiner will be called.
•It can be executed zero, one or many times, so that a given mapreducejob should not depend on the combiner executions and should always produce the same results.
•How many times combiner is called depends on the size of a particular intermediate map output and a value of min.num.spills.for.combineproperty.
•By default 3 spill files produced by a mapperare needed for the combiner to run, if combiner is specified.
10. Commutative and associative
•Abinary operationiscommutativeif changing the order of theoperandsdoes not change the result.
•A binary operationon asetSis calledcommutativeif: x * y = y * x for all x, y in S
•Abinary functionf : A x A -> Bis calledcommutativeif: f(x,y) = f(y,x) for all x,yin A
•A binary operationon asetSis calledassociativeif it satisfies theassociative law: ( x * y ) * z = x * ( y * z )
•Example of functions both commutative and associative include: sum(), max().
•What about mean? median?
11. What can be used as combiner and when?
•If a reduce function is commutative and associative, it can be also used as a combine function, so that there is no need to implement new code, just for a combiner.
•In the most cases, reduce functions are commutative and associative, so that combiner can be the same as reducer
•so that incorporating it into MapReduce job is as simple as specifying combiner class onJobobject (JobConfin old API).
•It results in temptation to always use combiner without ensuring whether it actually brings expected benefits or not.
•One should remember that combiner calls requires additional serialization and deserializationof intermediate data and it takes some time.
•Ideally, this time should be significantly shorter than the time that is saved on transferring data locally aggregated by combiners.
12. Rule of thumb?
•According to "grid patterns" discusses atYDN Hadoop Blog, applications that cannot aggregate the map-output bytes by 20- 30% should not be using combiners.
•The global ratio of aggregation can be calculated using two build-in counters: combiner input records counter and combiner output records counter.
•Hence, if you use combiners, always do have a quick look at these counters to ensure that the combiner provides a right level of aggregation.