SlideShare ist ein Scribd-Unternehmen logo
1 von 25
MapReduce
Design Patterns

Anastasiia Kornilova,
SoftServe Data Science Group
MapReduce Components
❖

record reader

❖

map

❖

Reader

combiner

❖

partitioner

❖

Mapper

Combiner

Partitioner

Shuffle
and sort

shuffle and sort

❖

reduce

❖

output format

Reducer

Output
MapReduce Patterns
❖

Filtering Patterns

❖

Summarization Patterns

❖

Join Patterns

❖

Data Organization Patterns

❖

Metapatterns

❖

Input and Output Patterns
Filtering patterns

❖

Filtering

❖

Bloom filtering

❖

Top-N

❖

Distinct
Filtering
❖

Closer view of data

❖

Tracking a thread of events

❖

Distributed grep

❖

Data cleansing

❖

Simple random sampling

❖

Removing low scoring data
Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file
Bloom filtering
❖

Removing most of non watched
values

❖

Prefiltering a data set for an
expensive set membership
check

•
•
•

Probabilistic data structure
Hash functions comparing
Answer: probably yes or now
Step 1 - Filter
Training
Bloom Filter
Training

Input
split

Output
file

Step 2 - Bloom Filtering via MapReduce

Input
split

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

No
Discarded

Load filter from
distributed cache

Input
split

Output
file

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

Output
file

No
Load filter from
distributed cache

Discarded
Top N
❖

Outlier analysis

❖

Select interesting data

❖

Catchy dashboards
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

Top Ten
Reducer
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

final top
10

Top 10
Output
Distinct
❖

Deduplicate data

❖

Getting distinct values

❖

Protecting from inner join
explosions
Summarization patterns
❖

Numerical summarization

❖

Inverted index

❖

Counting with counters
Numerical summarization

❖

Word count

❖

Record count

❖

Min/Max/Count

❖

Average/Median/Standart
deviation
Mapper

Mapper

Mapper

(key, summary field)
(key, summary field)

(key, summary field)
(key, summary field)

(key, summary field)
(key, summary field)

Partitoner
Reducer

(group B, summary)
(group D, summary)

Reducer

(group B, summary)
(group D, summary)

Partitoner

Partitoner
Inverted index
Mapper

(keyword, unique ID)
(keyword, unique ID)

Partitoner
Reducer

Reducer

(keyword, unique ID)
(keyword, unique ID)

(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner

Mapper

(keyword, unique ID)
(keyword, unique ID)

Mapper

(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner
Data Organization Patterns
❖

Structured to Hierarchical

❖

Partitioning

❖

Binning

❖

Total Order Sorting

❖

Shuffling
Join patterns

❖

Reduce Side Join

❖

Replicated Join

❖

Composite Join

❖

Cartesian Product
Data Set A
Input
split
Input
split
Input
split

Join
Mapper
Join
Mapper
Join
Mapper

(key, values
A)

(key, values
A)

Join
Reducer

Output
part

Join
Reducer

Output
part

Join
Reducer

Output
part

(key, values
A)

Shuffle
and sort

Data Set B
Input
split
Input
split

Join
Mapper
Join
Mapper

(key, values
B)
(key, values
B)
Node table

id
title
tagnames
authorized

User table

body
node type
parent id
abs parent id
added at
score
state string
last edited id
last activity id
last activity at
activity revision
extra
extra def
extra count

user id
reputation
gold
silver
bronze
Pig examples
- - Inner Join:
A = JOIN comments BY userID, users BY userID;

- - Outer Join:
A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID;

- - Binning:
SPLIT data INTO
eights IF col1 == 8,
bigs IF col1 > 8,
smalls IF (col1 < 8 and col1 > 0 );

- - Top Ten:
B = ORDER A BY col4 DESC’
C = limit B 10;

- - Filtering:
b = FILTER a BY value < 3;

Weitere ähnliche Inhalte

Ähnlich wie MapReduce Design Patterns

Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...Flink Forward
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Slides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSlides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSonaCharles2
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using pythonPurna Chander
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with AzureBarbara Fusinska
 
Scalable data pipeline
Scalable data pipelineScalable data pipeline
Scalable data pipelineGreenM
 
Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007dhafinnaviansyah
 
Microsoft Excel 2007 Tutorial
Microsoft Excel 2007 TutorialMicrosoft Excel 2007 Tutorial
Microsoft Excel 2007 Tutorialdhafinnaviansyah
 
Data Binding In Depth
Data Binding In DepthData Binding In Depth
Data Binding In DepthEyal Vardi
 
METODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATAMETODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATALuhSm
 
Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Elena Sügis
 

Ähnlich wie MapReduce Design Patterns (20)

Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Database
DatabaseDatabase
Database
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
17641.ppt
17641.ppt17641.ppt
17641.ppt
 
Slides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSlides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MD
 
17641.ppt
17641.ppt17641.ppt
17641.ppt
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Scalable data pipeline
Scalable data pipelineScalable data pipeline
Scalable data pipeline
 
Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007
 
Microsoft Excel 2007 Tutorial
Microsoft Excel 2007 TutorialMicrosoft Excel 2007 Tutorial
Microsoft Excel 2007 Tutorial
 
Data Binding In Depth
Data Binding In DepthData Binding In Depth
Data Binding In Depth
 
Pig latin
Pig latinPig latin
Pig latin
 
Knowage manual
Knowage manualKnowage manual
Knowage manual
 
The D3 Toolbox
The D3 ToolboxThe D3 Toolbox
The D3 Toolbox
 
METODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATAMETODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATA
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.
 

Mehr von Anastasiia Kornilova

Mehr von Anastasiia Kornilova (7)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
NLP approach for medical translation task
NLP approach for medical translation taskNLP approach for medical translation task
NLP approach for medical translation task
 
Kaggle - global Data Science community
Kaggle - global Data Science communityKaggle - global Data Science community
Kaggle - global Data Science community
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
 
Stay well with machine learning
Stay well with machine learningStay well with machine learning
Stay well with machine learning
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Mahout
MahoutMahout
Mahout
 

Kürzlich hochgeladen

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

MapReduce Design Patterns