SlideShare ist ein Scribd-Unternehmen logo
1 von 33
MapReduce
DesignPatterns
with
Evgeny Benediktov,
EIS Architecture
MapReduce Scalable
Flexible
No overhead
(K1,V1) –> Map –> (K2,V2)
Shuffle & Sort
(K2,List[V2]) –> Reduce –> (K3,V3)
How does MapReduce work?
Line 1: How many cookies could
Line 2: a good cook cook if a
Line 3: good cook could cook cookies?
WordCount
IN: Offset, Line1
OUT: could, 1
IN: Offset, Line3
OUT: cook, 1
OUT: could, 1
IN: Offset, Line2
OUT: cook, 1
OUT: cook, 1
OUT: if, 1
IN: could, <1, 1>
OUT: could, 2
IN: cook, <1, 1, 1>
OUT: cook, 3
IN: If, 1
OUT: If, 1
Shuffle & Sort
Buffer in RAM
Partition, Sort & Spill to disk
Pulled by Reducers
Merge
MongoDB
Spark
Hadoop
Where is MapReduce implemented?
Distributions
HDFS
MapReduce
Everything Else
What is inside
NameNode
DataNode DataNode DataNode
Append only
64-256MB Blocks
Replicated
HDFS
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
JobTraker
HDFS+MapReduce1
NameNode
Container
NodeManager
DataNode
Container
NodeManager
DataNode
AppMaster
NodeManager
DataNode
ResourceManager
HDFS+MapReduce2
Maper
Reducer
Partitoner
Combiner
InputFormat
OutputFormat
RecordReader
RecordWriter
Classes
(K2, V2)->(K2, List(V2))
setPartitionerClass
setGroupComparator
setSortComparatorClass
SecondarySort
MetaData
Client->HDFS->Local FS
DistributedCache
Summarization
Numerical
Summarizations
Inverted Index
Summarizations
Counting with Counter
Filtering
Filtering
Bloom Filtering
Top Ten
Distinct
Data
Organization
Structured to
Hierarchical
Partitioning
Binning
Total Order Sorting
Shuffling
Input and
Output
Generating Data
External Source Output
External Source Input
Partition Pruning
Metapatterns
Job Chaining
Job Merging
Joins
Reduce Side Join
Replicated Join
Composite Join
Cartesian Product
Summarizations
Summarization with Counters
No Reducer
Up to 100
Named
Filtering
map(key, record):
if (keep record) emit key,value
Identity Reducer or None
Output file per mapper
Bloom Filtering
Training: Records → BloomFilter File
Mapper.setup:
DistributedCache→BloomFilter
Mapper.map:
filter.membershipTest
Emit value, null
Filtering Top Ten
Mapper.setup(): initialize a sorted list
Mapper.map(key, record):
insert record into list
truncate list to 10
Mapper.cleanup():
for records in the list: emit null, record
Reducer.reduce(key, records):
as in mappers
Filtering Distinct Values
map(key, record):
emit record,null
reduce(key, records):
emit key
Structured to Hierarchical
Mappers on dataset1 send to Reducers:
Ids, Records of Type1
Mappers on dataset2 send to Reducers:
Parent Ids, Records of Type 2
Partitioning
Identity Mapper
Identity Reducer
Smart Partitioner:
public int getPartition(IntWritable key, Text value, int
numPartitions)
{
return key.get() /*year*/ - minLastAccessDateYear;
}
Binning
setup:
mos = new MultipleOutputs
map:
If (…) {
mos.write(key, value, BINNAME)
//BINNAME-mNNNNN
} else..
Shuffling
Mapper.map:
Emit random, record
Reducer.reduce:
Emit record, null
Map-side Join
Mapper.setup:
DistributedCache → Map (Right Table)
Mapper.map:
Read split of Left Table, Join
Reduce-Side Joins
With Secondary Sort
TableAMapper.map:
Emit primary key+’A’, record+’A’
TableBMapper.map:
Emit foreign key+’B’, record+’B’
SortComporator:
Records 'A' before Records 'B'
Reducer:
emits A` Record + B` Record, null
Composite (Merge) Join
Data sets pre-sorted
Data sets partitioned on the same key
CompositeInputFormat in Mappers
Total Order Sorting
Job 1:
Data → Mappers -> SequenceFile (key, value)
Job 2:
InputSampler
TotalOrderPartitioner(InputSampler)
Identity mapper, reducers
Input:
Site1 tag1
Site1 tag2
Site3 tag3
Output - top 10 similar sites per site, (secondary) sorted
Site1 Similar1 count-of-common-tags
Site1 Similar2 count-of-common-tags
Site2 Similar1 count-of-common-tags
Millions sites
Some tags are in thousands sites
What is input/output of each mapper/reducer?
Hint – chain jobs

Weitere ähnliche Inhalte

Andere mochten auch

Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configurationGerrit van Vuuren
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Andere mochten auch (11)

SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Ähnlich wie MapReduce DesignPatterns

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介Masayuki Matsushita
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
07 2
07 207 2
07 2a_b_g
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 

Ähnlich wie MapReduce DesignPatterns (20)

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
07 2
07 207 2
07 2
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Kürzlich hochgeladen

Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 

Kürzlich hochgeladen (20)

Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 

MapReduce DesignPatterns

Hinweis der Redaktion

  1. Not an overview of Hadoop
  2. Algorithmic template – for Distributed Batch Processing Flexible, bad for iterative algorithms Google Paper 2004
  3. Blocks are Mappers, Reducers, NOT CALLS Where? - In Hadoop implementation Mappers, Reducers are JVMs in cluster When? – slowstart.completedmaps, 5% def How many?
  4. Buffer in RAM - Spill after 80% of o.sort.mb (100MB def.), Maps blok during spill Partition, Sort & Spill to disk – Can do Group (If Combiners specified) Pulled by Reducers - (HTTP, Netty)
  5. How to write a MapReduce job?
  6. Pivotal HD IBM - BigInsight
  7. Google Papers Yahoo
  8. CAP – pick two Big Blocks – seek time; Too Big – concurrency Replicated – Cheap commodity
  9. Task Tracker – data locality
  10. AppMaster in Container in NodeManager (MRAppMaster) No Slotes => Containers differ in RAM Size/cores etc. and can ran anything Flexibility – cluster utilization MRAppMaster Uber task Shuffle Service of YARN
  11. Cleanup * Setup
  12. Sent with status updates context.getCounter(counterGroupName, counterName).increment(1) Driver collects outputs when job completes: for (Counter counter : job.getCounters().getGroup(counterGroupName)) { System.out.println(counter.getDisplayName() + "\t" + counter.getValue()); }
  13. Output file per mapper: part-m- (m instead of the r) Optional: Identity Reducer → one output file (hot spot, performance suffers)
  14. Parameters for BloomFilter construction: public static int getOptimalBloomFilterSize(int numElements, float falsePosRate) { return (int) (-numElements * (float) Math.log(falsePosRate) / Math.pow(Math.log(2), 2)); } public static int getOptimalK(float numElements, float vectorSize) { return (int) Math.round(vectorSize * Math.log(2) / numElements); }
  15. NOTE: Emits from mappers only in CLAEANUP SELECT * FROM table ORDER BY col1 LIMIT 10; Mapper.setup(): initialize top ten sorted list (e.g. TreeMap) Mapper.map(key, record): insert record into top ten sorted list truncate the list to a length of 10 Mapper.cleanup(): for record in top sorted ten list: emit null,record Reducer.reduce(key, records): emit top ten record (e.g. use TreeMap)
  16. SELECT DISTINCT * FROM table Use Combiners
  17. Data Organization Patterns “Join” to XML <department><employee<employee> <department> MultipleInputs – assign Mappers to Directories Many Type2 On 1 Type1 → Reducer Hot Spot
  18. Uses: Partition Pruning by date or by category Sharding
  19. Binning – Partitioning in Mappers Use derived class of MultipleOutput for exact format of output Pros: No reducers (performance), no really MapReduce Cons: Number of output files = Number of Bins * Number of Mappers
  20. SELECT * FROM data ORDER BY RAND() No hotspots
  21. All but one tables must fit RAM (JVM heap) The large data set is Left Table Inner or Left Outer Join (Unmatched records from Left Table go to the output)
  22. MultipleInputs TableAMapper.map adds 'A' to both output key and value TableBMapper.map adds 'B' to both output key and value Map: output key – primary key for A, foreign for B + tag Secondary sort puts a Record 'A' before Records 'B' Reducer emits 'A' Records matched with 'B' Records Only records A` in RAM The right way – with secondary sort Outer joins: emit even if only one type of Records present
  23. Many large inputs Map side only (No Reducers) - no really MapReduce Data sets sorted and partitioned on the same key All data sets have the same number of partitions All records for a key must be in 1 partition (GZIP is OK) CompositeInputFormat Number of output files = number of map tasks Performance: no file locality for splits of both tables Performance: data preparation needs
  24. Parallel - Multiple Reducers (otherwise trivial) Input of the second job: the SequenceFile The secon job: job.setPartitionerClass(TotalOrderPartitioner.class); “pivot of QuickSort”: InputSampler.writePartitionFile(job, new InputSampler.RandomSampler(.001, 10000)); job.addCacheFile(InputSampler);