MapReduce DesignPatterns

•Als PPTX, PDF herunterladen•

0 gefällt mir•1,043 views

Evgeny Benediktov

Slides that I use to explain MapReduce and the DesignPatterns.

Software

MapReduce
DesignPatterns
with
Evgeny Benediktov,
EIS Architecture

(K1,V1) –> Map –> (K2,V2)
Shuffle & Sort
(K2,List[V2]) –> Reduce –> (K3,V3)
How does MapReduce work?

Line 1: How many cookies could
Line 2: a good cook cook if a
Line 3: good cook could cook cookies?
WordCount

IN: Offset, Line1
OUT: could, 1
IN: Offset, Line3
OUT: cook, 1
OUT: could, 1
IN: Offset, Line2
OUT: cook, 1
OUT: cook, 1
OUT: if, 1
IN: could, <1, 1>
OUT: could, 2
IN: cook, <1, 1, 1>
OUT: cook, 3
IN: If, 1
OUT: If, 1

Shuffle & Sort
Buffer in RAM
Partition, Sort & Spill to disk
Pulled by Reducers
Merge

MongoDB
Spark
Hadoop
Where is MapReduce implemented?

HDFS
MapReduce
Everything Else
What is inside

NameNode
DataNode DataNode DataNode
Append only
64-256MB Blocks
Replicated
HDFS

NameNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
JobTraker
HDFS+MapReduce1

NameNode
Container
NodeManager
DataNode
Container
NodeManager
DataNode
AppMaster
NodeManager
DataNode
ResourceManager
HDFS+MapReduce2

Maper
Reducer
Partitoner
Combiner
InputFormat
OutputFormat
RecordReader
RecordWriter
Classes

(K2, V2)->(K2, List(V2))
setPartitionerClass
setGroupComparator
setSortComparatorClass
SecondarySort

MetaData
Client->HDFS->Local FS
DistributedCache

Summarization
Numerical
Summarizations
Inverted Index
Summarizations
Counting with Counter
Filtering
Filtering
Bloom Filtering
Top Ten
Distinct
Data
Organization
Structured to
Hierarchical
Partitioning
Binning
Total Order Sorting
Shuffling
Input and
Output
Generating Data
External Source Output
External Source Input
Partition Pruning
Metapatterns
Job Chaining
Job Merging
Joins
Reduce Side Join
Replicated Join
Composite Join
Cartesian Product

Summarization with Counters
No Reducer
Up to 100
Named

Filtering
map(key, record):
if (keep record) emit key,value
Identity Reducer or None
Output file per mapper

Bloom Filtering
Training: Records → BloomFilter File
Mapper.setup:
DistributedCache→BloomFilter
Mapper.map:
filter.membershipTest
Emit value, null

Filtering Top Ten
Mapper.setup(): initialize a sorted list
Mapper.map(key, record):
insert record into list
truncate list to 10
Mapper.cleanup():
for records in the list: emit null, record
Reducer.reduce(key, records):
as in mappers

Filtering Distinct Values
map(key, record):
emit record,null
reduce(key, records):
emit key

Structured to Hierarchical
Mappers on dataset1 send to Reducers:
Ids, Records of Type1
Mappers on dataset2 send to Reducers:
Parent Ids, Records of Type 2

Partitioning
Identity Mapper
Identity Reducer
Smart Partitioner:
public int getPartition(IntWritable key, Text value, int
numPartitions)
{
return key.get() /*year*/ - minLastAccessDateYear;
}

Binning
setup:
mos = new MultipleOutputs
map:
If (…) {
mos.write(key, value, BINNAME)
//BINNAME-mNNNNN
} else..

Shuffling
Mapper.map:
Emit random, record
Reducer.reduce:
Emit record, null

Map-side Join
Mapper.setup:
DistributedCache → Map (Right Table)
Mapper.map:
Read split of Left Table, Join

Reduce-Side Joins
With Secondary Sort
TableAMapper.map:
Emit primary key+’A’, record+’A’
TableBMapper.map:
Emit foreign key+’B’, record+’B’
SortComporator:
Records 'A' before Records 'B'
Reducer:
emits A` Record + B` Record, null

Composite (Merge) Join
Data sets pre-sorted
Data sets partitioned on the same key
CompositeInputFormat in Mappers

Total Order Sorting
Job 1:
Data → Mappers -> SequenceFile (key, value)
Job 2:
InputSampler
TotalOrderPartitioner(InputSampler)
Identity mapper, reducers

Input:
Site1 tag1
Site1 tag2
Site3 tag3
Output - top 10 similar sites per site, (secondary) sorted
Site1 Similar1 count-of-common-tags
Site1 Similar2 count-of-common-tags
Site2 Similar1 count-of-common-tags
Millions sites
Some tags are in thousands sites
What is input/output of each mapper/reducer?
Hint – chain jobs

Weitere ähnliche Inhalte

Andere mochten auch

SQL On HadoopMuhammad Ali

Map/Reduce introCARLOS III UNIVERSITY OF MADRID

Hadoop Installation and basic configurationGerrit van Vuuren

Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.

Hadoop Administration pdfEdureka!

Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed

Hadoop MapReduce FundamentalsLynn Langit

HIVE: Data Warehousing & Analytics on HadoopZheng Shao

Introduction To Map Reducerantav

Introduction to YARN and MapReduce 2Cloudera, Inc.

Hadoop Overview & Architecture EMC

Andere mochten auch (11)

SQL On Hadoop

Map/Reduce intro

Hadoop Installation and basic configuration

Hadoop Summit 2012 | Optimizing MapReduce Job Performance

Hadoop Administration pdf

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

Hadoop MapReduce Fundamentals

HIVE: Data Warehousing & Analytics on Hadoop

Introduction To Map Reduce

Introduction to YARN and MapReduce 2

Hadoop Overview & Architecture

Ähnlich wie MapReduce DesignPatterns

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease

HadoopScott Leberknight

Hadoop classes in mumbaiVibrant Technologies & Computers

Big Data Analytics with Hadoop with @techmilindEMC

Intro to Map ReduceDoron Vainrub

Introduction to Hadoop and MapReduceDr Ganesh Iyer

Ordered Record CollectionHadoop User Group

Introduction to MapReduce using DiscoJim Roepcke

Perl on Amazon Elastic MapReducePedro Figueiredo

MapReduceTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Map ReduceSri Prasanna

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar

Apache Spark & StreamingFernando Rodriguez

クラウドDWHとしても進化を続けるPivotal Greenplumご紹介Masayuki Matsushita

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Hadoop Map ReduceVNIT-ACM Student Chapter

07 2a_b_g

Hadoop-IntroductionSandeep Deshmukh

Introduction to HadoopApache Apex

Ähnlich wie MapReduce DesignPatterns (20)

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Hadoop

Hadoop classes in mumbai

Big Data Analytics with Hadoop with @techmilind

Intro to Map Reduce

Introduction to Hadoop and MapReduce

Ordered Record Collection

Introduction to MapReduce using Disco

Perl on Amazon Elastic MapReduce

MapReduce

Map Reduce

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Apache Spark & Streaming

クラウドDWHとしても進化を続けるPivotal Greenplumご紹介

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

Hadoop Map Reduce

07 2

Hadoop-Introduction

Introduction to Hadoop

Kürzlich hochgeladen

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

5 Signs You Need a Fashion PLM Software.pdfWave PLM

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Professional Resume Template for Software DevelopersVinodh Ram

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

DNT_Corporate presentation know about usDynamic Netsoft

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

EY_Graph Database Powered SustainabilityNeo4j

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

What is Fashion PLM and Why Do You Need ItWave PLM

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

chapter--4-software-project-planning.pptkotipi9215

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Kürzlich hochgeladen (20)

HR Software Buyers Guide in 2024 - HRSoftware.com

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

5 Signs You Need a Fashion PLM Software.pdf

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Professional Resume Template for Software Developers

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

DNT_Corporate presentation know about us

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Der Spagat zwischen BIAS und FAIRNESS (2024)

EY_Graph Database Powered Sustainability

why an Opensea Clone Script might be your perfect match.pdf

What is Fashion PLM and Why Do You Need It

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

chapter--4-software-project-planning.ppt

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Salesforce Certified Field Service Consultant

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

MapReduce DesignPatterns

1. MapReduce DesignPatterns with Evgeny Benediktov, EIS Architecture

2. MapReduce Scalable Flexible No overhead

3. (K1,V1) –> Map –> (K2,V2) Shuffle & Sort (K2,List[V2]) –> Reduce –> (K3,V3) How does MapReduce work?

4. Line 1: How many cookies could Line 2: a good cook cook if a Line 3: good cook could cook cookies? WordCount

5. IN: Offset, Line1 OUT: could, 1 IN: Offset, Line3 OUT: cook, 1 OUT: could, 1 IN: Offset, Line2 OUT: cook, 1 OUT: cook, 1 OUT: if, 1 IN: could, <1, 1> OUT: could, 2 IN: cook, <1, 1, 1> OUT: cook, 3 IN: If, 1 OUT: If, 1

6. Shuffle & Sort Buffer in RAM Partition, Sort & Spill to disk Pulled by Reducers Merge

7. MongoDB Spark Hadoop Where is MapReduce implemented?

8. Distributions

9. HDFS MapReduce Everything Else What is inside

10.

11. NameNode DataNode DataNode DataNode Append only 64-256MB Blocks Replicated HDFS

12. NameNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode JobTraker HDFS+MapReduce1

13.

14. NameNode Container NodeManager DataNode Container NodeManager DataNode AppMaster NodeManager DataNode ResourceManager HDFS+MapReduce2

15. Maper Reducer Partitoner Combiner InputFormat OutputFormat RecordReader RecordWriter Classes

16. (K2, V2)->(K2, List(V2)) setPartitionerClass setGroupComparator setSortComparatorClass SecondarySort

17. MetaData Client->HDFS->Local FS DistributedCache

18. Summarization Numerical Summarizations Inverted Index Summarizations Counting with Counter Filtering Filtering Bloom Filtering Top Ten Distinct Data Organization Structured to Hierarchical Partitioning Binning Total Order Sorting Shuffling Input and Output Generating Data External Source Output External Source Input Partition Pruning Metapatterns Job Chaining Job Merging Joins Reduce Side Join Replicated Join Composite Join Cartesian Product

19. Summarizations

20. Summarization with Counters No Reducer Up to 100 Named

21. Filtering map(key, record): if (keep record) emit key,value Identity Reducer or None Output file per mapper

22. Bloom Filtering Training: Records → BloomFilter File Mapper.setup: DistributedCache→BloomFilter Mapper.map: filter.membershipTest Emit value, null

23. Filtering Top Ten Mapper.setup(): initialize a sorted list Mapper.map(key, record): insert record into list truncate list to 10 Mapper.cleanup(): for records in the list: emit null, record Reducer.reduce(key, records): as in mappers

24. Filtering Distinct Values map(key, record): emit record,null reduce(key, records): emit key

25. Structured to Hierarchical Mappers on dataset1 send to Reducers: Ids, Records of Type1 Mappers on dataset2 send to Reducers: Parent Ids, Records of Type 2

26. Partitioning Identity Mapper Identity Reducer Smart Partitioner: public int getPartition(IntWritable key, Text value, int numPartitions) { return key.get() /*year*/ - minLastAccessDateYear; }

27. Binning setup: mos = new MultipleOutputs map: If (…) { mos.write(key, value, BINNAME) //BINNAME-mNNNNN } else..

28. Shuffling Mapper.map: Emit random, record Reducer.reduce: Emit record, null

29. Map-side Join Mapper.setup: DistributedCache → Map (Right Table) Mapper.map: Read split of Left Table, Join

30. Reduce-Side Joins With Secondary Sort TableAMapper.map: Emit primary key+’A’, record+’A’ TableBMapper.map: Emit foreign key+’B’, record+’B’ SortComporator: Records 'A' before Records 'B' Reducer: emits A` Record + B` Record, null

31. Composite (Merge) Join Data sets pre-sorted Data sets partitioned on the same key CompositeInputFormat in Mappers

32. Total Order Sorting Job 1: Data → Mappers -> SequenceFile (key, value) Job 2: InputSampler TotalOrderPartitioner(InputSampler) Identity mapper, reducers

33. Input: Site1 tag1 Site1 tag2 Site3 tag3 Output - top 10 similar sites per site, (secondary) sorted Site1 Similar1 count-of-common-tags Site1 Similar2 count-of-common-tags Site2 Similar1 count-of-common-tags Millions sites Some tags are in thousands sites What is input/output of each mapper/reducer? Hint – chain jobs

Hinweis der Redaktion

Not an overview of Hadoop
Algorithmic template – for Distributed Batch Processing Flexible, bad for iterative algorithms Google Paper 2004
Blocks are Mappers, Reducers, NOT CALLS Where? - In Hadoop implementation Mappers, Reducers are JVMs in cluster When? – slowstart.completedmaps, 5% def How many?
Buffer in RAM - Spill after 80% of o.sort.mb (100MB def.), Maps blok during spill Partition, Sort & Spill to disk – Can do Group (If Combiners specified) Pulled by Reducers - (HTTP, Netty)
How to write a MapReduce job?
Pivotal HD IBM - BigInsight
Google Papers Yahoo
CAP – pick two Big Blocks – seek time; Too Big – concurrency Replicated – Cheap commodity
Task Tracker – data locality
AppMaster in Container in NodeManager (MRAppMaster) No Slotes => Containers differ in RAM Size/cores etc. and can ran anything Flexibility – cluster utilization MRAppMaster Uber task Shuffle Service of YARN
Cleanup * Setup
Sent with status updates context.getCounter(counterGroupName, counterName).increment(1) Driver collects outputs when job completes: for (Counter counter : job.getCounters().getGroup(counterGroupName)) { System.out.println(counter.getDisplayName() + "\t" + counter.getValue()); }
Output file per mapper: part-m- (m instead of the r) Optional: Identity Reducer → one output file (hot spot, performance suffers)
Parameters for BloomFilter construction: public static int getOptimalBloomFilterSize(int numElements, float falsePosRate) { return (int) (-numElements * (float) Math.log(falsePosRate) / Math.pow(Math.log(2), 2)); } public static int getOptimalK(float numElements, float vectorSize) { return (int) Math.round(vectorSize * Math.log(2) / numElements); }
NOTE: Emits from mappers only in CLAEANUP SELECT * FROM table ORDER BY col1 LIMIT 10; Mapper.setup(): initialize top ten sorted list (e.g. TreeMap) Mapper.map(key, record): insert record into top ten sorted list truncate the list to a length of 10 Mapper.cleanup(): for record in top sorted ten list: emit null,record Reducer.reduce(key, records): emit top ten record (e.g. use TreeMap)
SELECT DISTINCT * FROM table Use Combiners
Data Organization Patterns “Join” to XML <department><employee<employee> <department> MultipleInputs – assign Mappers to Directories Many Type2 On 1 Type1 → Reducer Hot Spot
Uses: Partition Pruning by date or by category Sharding
Binning – Partitioning in Mappers Use derived class of MultipleOutput for exact format of output Pros: No reducers (performance), no really MapReduce Cons: Number of output files = Number of Bins * Number of Mappers
SELECT * FROM data ORDER BY RAND() No hotspots
All but one tables must fit RAM (JVM heap) The large data set is Left Table Inner or Left Outer Join (Unmatched records from Left Table go to the output)
MultipleInputs TableAMapper.map adds 'A' to both output key and value TableBMapper.map adds 'B' to both output key and value Map: output key – primary key for A, foreign for B + tag Secondary sort puts a Record 'A' before Records 'B' Reducer emits 'A' Records matched with 'B' Records Only records A` in RAM The right way – with secondary sort Outer joins: emit even if only one type of Records present
Many large inputs Map side only (No Reducers) - no really MapReduce Data sets sorted and partitioned on the same key All data sets have the same number of partitions All records for a key must be in 1 partition (GZIP is OK) CompositeInputFormat Number of output files = number of map tasks Performance: no file locality for splits of both tables Performance: data preparation needs
Parallel - Multiple Reducers (otherwise trivial) Input of the second job: the SequenceFile The secon job: job.setPartitionerClass(TotalOrderPartitioner.class); “pivot of QuickSort”: InputSampler.writePartitionFile(job, new InputSampler.RandomSampler(.001, 10000)); job.addCacheFile(InputSampler);

MapReduce DesignPatterns

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie MapReduce DesignPatterns

Ähnlich wie MapReduce DesignPatterns (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MapReduce DesignPatterns

Hinweis der Redaktion