23. Filtering Top Ten
Mapper.setup(): initialize a sorted list
Mapper.map(key, record):
insert record into list
truncate list to 10
Mapper.cleanup():
for records in the list: emit null, record
Reducer.reduce(key, records):
as in mappers
25. Structured to Hierarchical
Mappers on dataset1 send to Reducers:
Ids, Records of Type1
Mappers on dataset2 send to Reducers:
Parent Ids, Records of Type 2
30. Reduce-Side Joins
With Secondary Sort
TableAMapper.map:
Emit primary key+’A’, record+’A’
TableBMapper.map:
Emit foreign key+’B’, record+’B’
SortComporator:
Records 'A' before Records 'B'
Reducer:
emits A` Record + B` Record, null
31. Composite (Merge) Join
Data sets pre-sorted
Data sets partitioned on the same key
CompositeInputFormat in Mappers
32. Total Order Sorting
Job 1:
Data → Mappers -> SequenceFile (key, value)
Job 2:
InputSampler
TotalOrderPartitioner(InputSampler)
Identity mapper, reducers
33. Input:
Site1 tag1
Site1 tag2
Site3 tag3
Output - top 10 similar sites per site, (secondary) sorted
Site1 Similar1 count-of-common-tags
Site1 Similar2 count-of-common-tags
Site2 Similar1 count-of-common-tags
Millions sites
Some tags are in thousands sites
What is input/output of each mapper/reducer?
Hint – chain jobs
Hinweis der Redaktion
Not an overview of Hadoop
Algorithmic template – for Distributed Batch Processing
Flexible, bad for iterative algorithms
Google Paper 2004
Blocks are Mappers, Reducers, NOT CALLS
Where? - In Hadoop implementation Mappers, Reducers are JVMs in cluster
When? – slowstart.completedmaps, 5% def
How many?
Buffer in RAM - Spill after 80% of o.sort.mb (100MB def.), Maps blok during spill
Partition, Sort & Spill to disk – Can do Group (If Combiners specified)
Pulled by Reducers - (HTTP, Netty)
How to write a MapReduce job?
Pivotal HD
IBM - BigInsight
Google Papers
Yahoo
CAP – pick two
Big Blocks – seek time; Too Big – concurrency
Replicated – Cheap commodity
Task Tracker – data locality
AppMaster in Container in NodeManager (MRAppMaster)
No Slotes => Containers differ in RAM Size/cores etc. and can ran anything
Flexibility – cluster utilization
MRAppMaster
Uber task
Shuffle Service of YARN
Cleanup * Setup
Sent with status updates
context.getCounter(counterGroupName, counterName).increment(1)
Driver collects outputs when job completes:
for (Counter counter : job.getCounters().getGroup(counterGroupName)) {
System.out.println(counter.getDisplayName() + "\t" + counter.getValue());
}
Output file per mapper: part-m- (m instead of the r)
Optional: Identity Reducer → one output file (hot spot, performance suffers)
Parameters for BloomFilter construction:
public static int getOptimalBloomFilterSize(int numElements,
float falsePosRate) {
return (int) (-numElements * (float) Math.log(falsePosRate)
/ Math.pow(Math.log(2), 2));
}
public static int getOptimalK(float numElements, float vectorSize) {
return (int) Math.round(vectorSize * Math.log(2) / numElements);
}
NOTE: Emits from mappers only in CLAEANUP
SELECT * FROM table ORDER BY col1 LIMIT 10;
Mapper.setup():
initialize top ten sorted list (e.g. TreeMap)
Mapper.map(key, record):
insert record into top ten sorted list
truncate the list to a length of 10
Mapper.cleanup():
for record in top sorted ten list:
emit null,record
Reducer.reduce(key, records):
emit top ten record (e.g. use TreeMap)
SELECT DISTINCT * FROM table
Use Combiners
Data Organization Patterns
“Join” to XML
<department><employee<employee> <department>
MultipleInputs – assign Mappers to Directories
Many Type2 On 1 Type1 → Reducer Hot Spot
Uses:
Partition Pruning by date or by category
Sharding
Binning – Partitioning in Mappers
Use derived class of MultipleOutput for exact format of output
Pros:
No reducers (performance), no really MapReduce
Cons:
Number of output files = Number of Bins * Number of Mappers
SELECT * FROM data ORDER BY RAND()
No hotspots
All but one tables must fit RAM (JVM heap)
The large data set is Left Table
Inner or Left Outer Join (Unmatched records from Left Table go to the output)
MultipleInputs
TableAMapper.map adds 'A' to both output key and value
TableBMapper.map adds 'B' to both output key and value
Map: output key – primary key for A, foreign for B + tag
Secondary sort puts a Record 'A' before Records 'B'
Reducer emits 'A' Records matched with 'B' Records
Only records A` in RAM
The right way – with secondary sort
Outer joins: emit even if only one type of Records present
Many large inputs
Map side only (No Reducers) - no really MapReduce
Data sets sorted and partitioned on the same key
All data sets have the same number of partitions
All records for a key must be in 1 partition (GZIP is OK)
CompositeInputFormat
Number of output files = number of map tasks
Performance: no file locality for splits of both tables
Performance: data preparation needs
Parallel - Multiple Reducers (otherwise trivial)
Input of the second job: the SequenceFile
The secon job: job.setPartitionerClass(TotalOrderPartitioner.class);
“pivot of QuickSort”: InputSampler.writePartitionFile(job, new InputSampler.RandomSampler(.001, 10000));
job.addCacheFile(InputSampler);