SlideShare a Scribd company logo
1 of 13
Download to read offline
Combiner and Partitioner
Abstract MapReduce Framework
Recall 
•As the Map operation is parallelized the input file set is firstsplit to several pieces calledFileSplits. 
•Then a new map task is created per FileSplit. 
•When an individual map task starts it will open a new outputwriter per configured reduce task. 
•It will then proceed to readits FileSplitusing theRecordReaderit gets from the specifiedInputFormat. 
•InputFormatparses the input and generateskey-value pairs. 
•As key-value pairs are read from the RecordReaderthey arepassed to the configuredMapper. 
•The user supplied Mapperdoeswhatever it wants with the input pair and callsOutputCollector.collectwith key-value pairs of its own choosing. The Map output is written into a SequenceFile.
Partitioner 
•When Mapperoutput is collected it is partitioned, which meansthat it will be written to the output specified by thePartitioner. 
•Partitionersare responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers. 
•In other words, the partitionerspecifies the task to which an intermediate key-value pair must be copied. 
•simplest partitionerinvolves computing the hash value of the key and then taking the mod of that value with the number of reducers. 
•This assigns approximately the same number of keys to each reducer (dependent on the quality of the hash function).
Imbalance in partitioning 
•The partitioneronly considers the key and ignores the value 
•Therefore, a roughly-even partitioning of the key space may nevertheless yield large differences in the number of key-values pairs sent to each reducer 
•This imbalance in the amount of data associated with each key is relatively common in many text processing applications due to the Zipfiandistribution of word occurrences. 
•Zipf'slaw states that given somecorpusofnatural languageutterances, the frequency of any word isinversely proportionalto its rank in the frequency table. 
•Thus the most frequent word will occur approximately twice as often as the second most frequentword, three times as often as thethird most frequent word, etc.
Combiner 
•Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and sort phase. 
•When the mapoperation outputs its pairs they are already available inmemory. 
•For efficiency reasons, sometimes it makes sense totake advantage of this fact by supplying a combiner class to perform a reduce-type function. 
•If a combiner is used then themap key-value pairs are not immediately written to the output. 
•Instead theywill be collected in lists, one list per each key value. 
•When acertain number of key-value pairs have been written, 
•This bufferis flushed by passing all the values of each key to thecombiner's reduce method and 
•outputting the key-value pairs of the combine operation as if they were created by the original mapoperation.
MapReduce with Partitioner and Combiner
Example 
•A word count MapReduce application whose mapoperation outputs (word, 1) pairs as words are encountered inthe input can use a combiner to speed up processing. 
•A combineoperation will start gathering the output in in-memory lists (insteadof on disk), one list per word. 
•Once a certain number ofpairs is output, the combine operation will be called once perunique word with the list available as an iterator. 
•The combinerthen emits (word, count-in-this-part-of-the-input) pairs. 
•Fromthe viewpoint of the Reduce operation this contains the same information as the original Map output, but there should be far fewerpairs output to disk and read from disk.
When combiner is used 
•Combiner helps to reduce the amount of data transferred from themapperto thereducers andsave as much priceless bandwidth as possible. 
•In many cases it can dramatically increase the efficiency of mapreducejob. 
•Because combiner is an optimization, Hadoop does not guarantee how many times a combiner will be called. 
•It can be executed zero, one or many times, so that a given mapreducejob should not depend on the combiner executions and should always produce the same results. 
•How many times combiner is called depends on the size of a particular intermediate map output and a value of min.num.spills.for.combineproperty. 
•By default 3 spill files produced by a mapperare needed for the combiner to run, if combiner is specified.
Commutative and associative 
•Abinary operationiscommutativeif changing the order of theoperandsdoes not change the result. 
•A binary operationon asetSis calledcommutativeif: x * y = y * x for all x, y in S 
•Abinary functionf : A x A -> Bis calledcommutativeif: f(x,y) = f(y,x) for all x,yin A 
•A binary operationon asetSis calledassociativeif it satisfies theassociative law: ( x * y ) * z = x * ( y * z ) 
•Example of functions both commutative and associative include: sum(), max(). 
•What about mean? median?
What can be used as combiner and when? 
•If a reduce function is commutative and associative, it can be also used as a combine function, so that there is no need to implement new code, just for a combiner. 
•In the most cases, reduce functions are commutative and associative, so that combiner can be the same as reducer 
•so that incorporating it into MapReduce job is as simple as specifying combiner class onJobobject (JobConfin old API). 
•It results in temptation to always use combiner without ensuring whether it actually brings expected benefits or not. 
•One should remember that combiner calls requires additional serialization and deserializationof intermediate data and it takes some time. 
•Ideally, this time should be significantly shorter than the time that is saved on transferring data locally aggregated by combiners.
Rule of thumb? 
•According to "grid patterns" discusses atYDN Hadoop Blog, applications that cannot aggregate the map-output bytes by 20- 30% should not be using combiners. 
•The global ratio of aggregation can be calculated using two build-in counters: combiner input records counter and combiner output records counter. 
•Hence, if you use combiners, always do have a quick look at these counters to ensure that the combiner provides a right level of aggregation.
End of session 
Day –2: Combiner and Partitioner

More Related Content

What's hot

Data cube computation
Data cube computationData cube computation
Data cube computation
Rashmi Sheikh
 

What's hot (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Apache spark
Apache sparkApache spark
Apache spark
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Data cube computation
Data cube computationData cube computation
Data cube computation
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage Collection
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Server system architecture
Server system architectureServer system architecture
Server system architecture
 
Data cubes
Data cubesData cubes
Data cubes
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithms
 
Graph colouring
Graph colouringGraph colouring
Graph colouring
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 

Similar to Hadoop combiner and partitioner

module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 

Similar to Hadoop combiner and partitioner (20)

Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
MapReduce
MapReduceMapReduce
MapReduce
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
IOE MODULE 6.pptx
IOE MODULE 6.pptxIOE MODULE 6.pptx
IOE MODULE 6.pptx
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Design patterns in MapReduce
Design patterns in MapReduceDesign patterns in MapReduce
Design patterns in MapReduce
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 

More from Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 

More from Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 

Recently uploaded

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Hadoop combiner and partitioner

  • 3. Recall •As the Map operation is parallelized the input file set is firstsplit to several pieces calledFileSplits. •Then a new map task is created per FileSplit. •When an individual map task starts it will open a new outputwriter per configured reduce task. •It will then proceed to readits FileSplitusing theRecordReaderit gets from the specifiedInputFormat. •InputFormatparses the input and generateskey-value pairs. •As key-value pairs are read from the RecordReaderthey arepassed to the configuredMapper. •The user supplied Mapperdoeswhatever it wants with the input pair and callsOutputCollector.collectwith key-value pairs of its own choosing. The Map output is written into a SequenceFile.
  • 4. Partitioner •When Mapperoutput is collected it is partitioned, which meansthat it will be written to the output specified by thePartitioner. •Partitionersare responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers. •In other words, the partitionerspecifies the task to which an intermediate key-value pair must be copied. •simplest partitionerinvolves computing the hash value of the key and then taking the mod of that value with the number of reducers. •This assigns approximately the same number of keys to each reducer (dependent on the quality of the hash function).
  • 5. Imbalance in partitioning •The partitioneronly considers the key and ignores the value •Therefore, a roughly-even partitioning of the key space may nevertheless yield large differences in the number of key-values pairs sent to each reducer •This imbalance in the amount of data associated with each key is relatively common in many text processing applications due to the Zipfiandistribution of word occurrences. •Zipf'slaw states that given somecorpusofnatural languageutterances, the frequency of any word isinversely proportionalto its rank in the frequency table. •Thus the most frequent word will occur approximately twice as often as the second most frequentword, three times as often as thethird most frequent word, etc.
  • 6. Combiner •Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and sort phase. •When the mapoperation outputs its pairs they are already available inmemory. •For efficiency reasons, sometimes it makes sense totake advantage of this fact by supplying a combiner class to perform a reduce-type function. •If a combiner is used then themap key-value pairs are not immediately written to the output. •Instead theywill be collected in lists, one list per each key value. •When acertain number of key-value pairs have been written, •This bufferis flushed by passing all the values of each key to thecombiner's reduce method and •outputting the key-value pairs of the combine operation as if they were created by the original mapoperation.
  • 8. Example •A word count MapReduce application whose mapoperation outputs (word, 1) pairs as words are encountered inthe input can use a combiner to speed up processing. •A combineoperation will start gathering the output in in-memory lists (insteadof on disk), one list per word. •Once a certain number ofpairs is output, the combine operation will be called once perunique word with the list available as an iterator. •The combinerthen emits (word, count-in-this-part-of-the-input) pairs. •Fromthe viewpoint of the Reduce operation this contains the same information as the original Map output, but there should be far fewerpairs output to disk and read from disk.
  • 9. When combiner is used •Combiner helps to reduce the amount of data transferred from themapperto thereducers andsave as much priceless bandwidth as possible. •In many cases it can dramatically increase the efficiency of mapreducejob. •Because combiner is an optimization, Hadoop does not guarantee how many times a combiner will be called. •It can be executed zero, one or many times, so that a given mapreducejob should not depend on the combiner executions and should always produce the same results. •How many times combiner is called depends on the size of a particular intermediate map output and a value of min.num.spills.for.combineproperty. •By default 3 spill files produced by a mapperare needed for the combiner to run, if combiner is specified.
  • 10. Commutative and associative •Abinary operationiscommutativeif changing the order of theoperandsdoes not change the result. •A binary operationon asetSis calledcommutativeif: x * y = y * x for all x, y in S •Abinary functionf : A x A -> Bis calledcommutativeif: f(x,y) = f(y,x) for all x,yin A •A binary operationon asetSis calledassociativeif it satisfies theassociative law: ( x * y ) * z = x * ( y * z ) •Example of functions both commutative and associative include: sum(), max(). •What about mean? median?
  • 11. What can be used as combiner and when? •If a reduce function is commutative and associative, it can be also used as a combine function, so that there is no need to implement new code, just for a combiner. •In the most cases, reduce functions are commutative and associative, so that combiner can be the same as reducer •so that incorporating it into MapReduce job is as simple as specifying combiner class onJobobject (JobConfin old API). •It results in temptation to always use combiner without ensuring whether it actually brings expected benefits or not. •One should remember that combiner calls requires additional serialization and deserializationof intermediate data and it takes some time. •Ideally, this time should be significantly shorter than the time that is saved on transferring data locally aggregated by combiners.
  • 12. Rule of thumb? •According to "grid patterns" discusses atYDN Hadoop Blog, applications that cannot aggregate the map-output bytes by 20- 30% should not be using combiners. •The global ratio of aggregation can be calculated using two build-in counters: combiner input records counter and combiner output records counter. •Hence, if you use combiners, always do have a quick look at these counters to ensure that the combiner provides a right level of aggregation.
  • 13. End of session Day –2: Combiner and Partitioner