SlideShare ist ein Scribd-Unternehmen logo
1 von 33
MapReduce
DesignPatterns
with
Evgeny Benediktov,
EIS Architecture
MapReduce Scalable
Flexible
No overhead
(K1,V1) –> Map –> (K2,V2)
Shuffle & Sort
(K2,List[V2]) –> Reduce –> (K3,V3)
How does MapReduce work?
Line 1: How many cookies could
Line 2: a good cook cook if a
Line 3: good cook could cook cookies?
WordCount
IN: Offset, Line1
OUT: could, 1
IN: Offset, Line3
OUT: cook, 1
OUT: could, 1
IN: Offset, Line2
OUT: cook, 1
OUT: cook, 1
OUT: if, 1
IN: could, <1, 1>
OUT: could, 2
IN: cook, <1, 1, 1>
OUT: cook, 3
IN: If, 1
OUT: If, 1
Shuffle & Sort
Buffer in RAM
Partition, Sort & Spill to disk
Pulled by Reducers
Merge
MongoDB
Spark
Hadoop
Where is MapReduce implemented?
Distributions
HDFS
MapReduce
Everything Else
What is inside
NameNode
DataNode DataNode DataNode
Append only
64-256MB Blocks
Replicated
HDFS
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
JobTraker
HDFS+MapReduce1
NameNode
Container
NodeManager
DataNode
Container
NodeManager
DataNode
AppMaster
NodeManager
DataNode
ResourceManager
HDFS+MapReduce2
Maper
Reducer
Partitoner
Combiner
InputFormat
OutputFormat
RecordReader
RecordWriter
Classes
(K2, V2)->(K2, List(V2))
setPartitionerClass
setGroupComparator
setSortComparatorClass
SecondarySort
MetaData
Client->HDFS->Local FS
DistributedCache
Summarization
Numerical
Summarizations
Inverted Index
Summarizations
Counting with Counter
Filtering
Filtering
Bloom Filtering
Top Ten
Distinct
Data
Organization
Structured to
Hierarchical
Partitioning
Binning
Total Order Sorting
Shuffling
Input and
Output
Generating Data
External Source Output
External Source Input
Partition Pruning
Metapatterns
Job Chaining
Job Merging
Joins
Reduce Side Join
Replicated Join
Composite Join
Cartesian Product
Summarizations
Summarization with Counters
No Reducer
Up to 100
Named
Filtering
map(key, record):
if (keep record) emit key,value
Identity Reducer or None
Output file per mapper
Bloom Filtering
Training: Records → BloomFilter File
Mapper.setup:
DistributedCache→BloomFilter
Mapper.map:
filter.membershipTest
Emit value, null
Filtering Top Ten
Mapper.setup(): initialize a sorted list
Mapper.map(key, record):
insert record into list
truncate list to 10
Mapper.cleanup():
for records in the list: emit null, record
Reducer.reduce(key, records):
as in mappers
Filtering Distinct Values
map(key, record):
emit record,null
reduce(key, records):
emit key
Structured to Hierarchical
Mappers on dataset1 send to Reducers:
Ids, Records of Type1
Mappers on dataset2 send to Reducers:
Parent Ids, Records of Type 2
Partitioning
Identity Mapper
Identity Reducer
Smart Partitioner:
public int getPartition(IntWritable key, Text value, int
numPartitions)
{
return key.get() /*year*/ - minLastAccessDateYear;
}
Binning
setup:
mos = new MultipleOutputs
map:
If (…) {
mos.write(key, value, BINNAME)
//BINNAME-mNNNNN
} else..
Shuffling
Mapper.map:
Emit random, record
Reducer.reduce:
Emit record, null
Map-side Join
Mapper.setup:
DistributedCache → Map (Right Table)
Mapper.map:
Read split of Left Table, Join
Reduce-Side Joins
With Secondary Sort
TableAMapper.map:
Emit primary key+’A’, record+’A’
TableBMapper.map:
Emit foreign key+’B’, record+’B’
SortComporator:
Records 'A' before Records 'B'
Reducer:
emits A` Record + B` Record, null
Composite (Merge) Join
Data sets pre-sorted
Data sets partitioned on the same key
CompositeInputFormat in Mappers
Total Order Sorting
Job 1:
Data → Mappers -> SequenceFile (key, value)
Job 2:
InputSampler
TotalOrderPartitioner(InputSampler)
Identity mapper, reducers
Input:
Site1 tag1
Site1 tag2
Site3 tag3
Output - top 10 similar sites per site, (secondary) sorted
Site1 Similar1 count-of-common-tags
Site1 Similar2 count-of-common-tags
Site2 Similar1 count-of-common-tags
Millions sites
Some tags are in thousands sites
What is input/output of each mapper/reducer?
Hint – chain jobs

Weitere ähnliche Inhalte

Andere mochten auch

Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configurationGerrit van Vuuren
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Andere mochten auch (11)

SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Ähnlich wie MapReduce DesignPatterns

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介Masayuki Matsushita
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
07 2
07 207 2
07 2a_b_g
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 

Ähnlich wie MapReduce DesignPatterns (20)

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
07 2
07 207 2
07 2
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Kürzlich hochgeladen

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 

Kürzlich hochgeladen (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

MapReduce DesignPatterns

Hinweis der Redaktion

  1. Not an overview of Hadoop
  2. Algorithmic template – for Distributed Batch Processing Flexible, bad for iterative algorithms Google Paper 2004
  3. Blocks are Mappers, Reducers, NOT CALLS Where? - In Hadoop implementation Mappers, Reducers are JVMs in cluster When? – slowstart.completedmaps, 5% def How many?
  4. Buffer in RAM - Spill after 80% of o.sort.mb (100MB def.), Maps blok during spill Partition, Sort & Spill to disk – Can do Group (If Combiners specified) Pulled by Reducers - (HTTP, Netty)
  5. How to write a MapReduce job?
  6. Pivotal HD IBM - BigInsight
  7. Google Papers Yahoo
  8. CAP – pick two Big Blocks – seek time; Too Big – concurrency Replicated – Cheap commodity
  9. Task Tracker – data locality
  10. AppMaster in Container in NodeManager (MRAppMaster) No Slotes => Containers differ in RAM Size/cores etc. and can ran anything Flexibility – cluster utilization MRAppMaster Uber task Shuffle Service of YARN
  11. Cleanup * Setup
  12. Sent with status updates context.getCounter(counterGroupName, counterName).increment(1) Driver collects outputs when job completes: for (Counter counter : job.getCounters().getGroup(counterGroupName)) { System.out.println(counter.getDisplayName() + "\t" + counter.getValue()); }
  13. Output file per mapper: part-m- (m instead of the r) Optional: Identity Reducer → one output file (hot spot, performance suffers)
  14. Parameters for BloomFilter construction: public static int getOptimalBloomFilterSize(int numElements, float falsePosRate) { return (int) (-numElements * (float) Math.log(falsePosRate) / Math.pow(Math.log(2), 2)); } public static int getOptimalK(float numElements, float vectorSize) { return (int) Math.round(vectorSize * Math.log(2) / numElements); }
  15. NOTE: Emits from mappers only in CLAEANUP SELECT * FROM table ORDER BY col1 LIMIT 10; Mapper.setup(): initialize top ten sorted list (e.g. TreeMap) Mapper.map(key, record): insert record into top ten sorted list truncate the list to a length of 10 Mapper.cleanup(): for record in top sorted ten list: emit null,record Reducer.reduce(key, records): emit top ten record (e.g. use TreeMap)
  16. SELECT DISTINCT * FROM table Use Combiners
  17. Data Organization Patterns “Join” to XML <department><employee<employee> <department> MultipleInputs – assign Mappers to Directories Many Type2 On 1 Type1 → Reducer Hot Spot
  18. Uses: Partition Pruning by date or by category Sharding
  19. Binning – Partitioning in Mappers Use derived class of MultipleOutput for exact format of output Pros: No reducers (performance), no really MapReduce Cons: Number of output files = Number of Bins * Number of Mappers
  20. SELECT * FROM data ORDER BY RAND() No hotspots
  21. All but one tables must fit RAM (JVM heap) The large data set is Left Table Inner or Left Outer Join (Unmatched records from Left Table go to the output)
  22. MultipleInputs TableAMapper.map adds 'A' to both output key and value TableBMapper.map adds 'B' to both output key and value Map: output key – primary key for A, foreign for B + tag Secondary sort puts a Record 'A' before Records 'B' Reducer emits 'A' Records matched with 'B' Records Only records A` in RAM The right way – with secondary sort Outer joins: emit even if only one type of Records present
  23. Many large inputs Map side only (No Reducers) - no really MapReduce Data sets sorted and partitioned on the same key All data sets have the same number of partitions All records for a key must be in 1 partition (GZIP is OK) CompositeInputFormat Number of output files = number of map tasks Performance: no file locality for splits of both tables Performance: data preparation needs
  24. Parallel - Multiple Reducers (otherwise trivial) Input of the second job: the SequenceFile The secon job: job.setPartitionerClass(TotalOrderPartitioner.class); “pivot of QuickSort”: InputSampler.writePartitionFile(job, new InputSampler.RandomSampler(.001, 10000)); job.addCacheFile(InputSampler);