MapReduce and Hadoop

•Als PPTX, PDF herunterladen•

1 gefällt mir•948 views

Salil Navgire

Technologie Bildung

Big Data Explosion
• 90% of today's data was created in the last 2 years

• Moore's law: Data volume doubles every 18
months
• YouTube: 13 million hours and 700 billion views in
2010
• Facebook: 20TB/day (compressed)
• CERN/LHC: 40TB/day (15PB/year)

• Many more examples

Solution: Scalability
How?
Divide and Conquer

Challenges!
• How to assign units of work to the workers?

• What if there are more units of work than workers?
• What if the workers need to share intermediate
incomplete data?

• How do we aggregate such intermediate data?
• How do we know when all workers have completed
their assignments?

• What if some workers failed?

History
• 2000: Apache Lucene: batch index updates and
sort/merge with on disk index
• 2002: Apache Nutch: distributed, scalable open
source web crawler
• 2004: Google publishes GFS and MapReduce
papers
• 2006: Apache Hadoop: open source Java
implementation of GFS and MapReduce to solve
Nutch’ problem; later becomes standalone project

What is Map Reduce?
• A programming model to distribute a task on
multiple nodes
• Used to develop solutions that will process large
amounts of data in a parallelized fashion in clusters
of computing nodes

• Original MapReduce paper by Google
• Features of MapReduce:
• Fault-tolerance
• Status and monitoring tools
• A clean abstraction for programmers

MapReduce Execution Overview
User
Program
fork
assign
map
Input Data
Split 0
read
Split 1
Split 2

fork
Master

fork

assign
reduce

Worker

Worker

Worker

local
write

Worker
Worker

remote
read,
sort

write

Output
File 0
Output
File 1

Hadoop Components

Storage

Processing

HDFS

MapReduce

Self-healing
high-bandwidth
clustered storage

Fault-tolerant
distributed
processing

HDFS Basics
• HDFS is a filesystem written in Java
• Sits on top of a native filesystem
• Provides redundant storage for massive
amounts of data

• Use Commodity devices

HDFS Data
• Data is split into blocks and stored on
multiple nodes in the cluster
• Each block is usually 64 MB or 128 MB
• Each block is replicated multiple times
• Replicas stored on different data nodes

2 Types of Nodes
Slave Nodes
Master Nodes

Master Node
• NameNode
• only 1 per cluster
• metadata server and database
• SecondaryNameNode helps with some housekeeping

• JobTracker
• only 1 per cluster
• job scheduler

Slave Nodes
• DataNodes
• 1-4000 per cluster
• block data storage

• TaskTrackers
• 1-4000 per cluster
• task execution

NameNode
• A single NameNode stores all
metadata, replication of blocks and
read/write access to files
• Filenames, locations on DataNodes of each
block, owner, group, etc.
• All information maintained in RAM for fast
lookup

Secondary NameNode
• Does memory-intensive administrative
functions for the NameNode
• Should run on a separate machine

Data Node
• DataNodes store file contents
• Different blocks of the same file will be
stored on different DataNodes
• Same block is stored on three (or more)
DataNodes for redundancy

Word Count Example
• Input
• Text files

• Output
• Single file containing (Word <TAB> Count)

• Map Phase
• Generates (Word, Count) pairs
• [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]

• Reduce Phase
• For each word, calculates aggregate
• [{a,7}, {b,5}, {c,6}]

Typical Cluster
• 3-4000 commodity servers

• Each server
• 2x quad-core
• 16-24 GB ram

• 4-12 TB disk space

• 20-30 servers per rack

When Should I use it?
Good choice for jobs that can be broken into parallelized jobs:

• Indexing/Analysis of log files
• Sorting of large data sets
• Image Processing/Machine Learning

Bad choice for serial or low latency jobs:
• For real-time processing
• For processing intensive task with little data
• Replacing MySQL

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to MongoDBRavi Teja

Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemYahoo Developer Network

Dataspace presentatieRoland Cornelissen

Alluxio Data Orchestration Platform for the CloudShubham Tagra

Wikimedia Content API (Strangeloop)Eric Evans

Gfs and map redusingilashanawaz

Mongo db cluster administration and Shredded DatabasesAbhinav Jha

RubiXShubham Tagra

Using MongoDB For BigData in 20 MinutesAndrás Fehér

Mongo presentation confShridhar Joshi

DSD-INT 2017 The use of big data for dredging - De BoerDeltares

Why Spark for large scale data analysisNithish Sankaranarayanan

Iceberg: a fast table format for S3DataWorks Summit

SANSA ISWC 2017 TalkJens Lehmann

Big Data Overview Part 1William Simms

Apache cassandraAdnan Siddiqi

shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014Gerd König

Barcamp MySQLKris Buytaert

Mashing the dataFelix Crisan

Why geoserverTek Kshetri

Was ist angesagt? (20)

Introduction to MongoDB

Key Challenges in Cloud Computing and How Yahoo! is Approaching Them

Dataspace presentatie

Alluxio Data Orchestration Platform for the Cloud

Wikimedia Content API (Strangeloop)

Gfs and map redusing

Mongo db cluster administration and Shredded Databases

RubiX

Using MongoDB For BigData in 20 Minutes

Mongo presentation conf

DSD-INT 2017 The use of big data for dredging - De Boer

Why Spark for large scale data analysis

Iceberg: a fast table format for S3

SANSA ISWC 2017 Talk

Big Data Overview Part 1

Apache cassandra

shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014

Barcamp MySQL

Mashing the data

Why geoserver

Ähnlich wie MapReduce and Hadoop

Intro to Big Data - SparkSofian Hadiwijaya

Hadoop-Quick introductionSandeep Singh

August 2013 HUG: Removing the NameNode's memory limitation Yahoo Developer Network

Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko

2. hadoop fundamentalsLokesh Ramaswamy

Hadoop introductionmusrath mohammad

Intro to big data choco devday - 23-01-2014Hassan Islamov

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1

Big data and hadoop anupamaAnupama Prabhudesai

Big Data & Hadoop IntroductionJayant Mukherjee

Hadoop ppt1chariorienit

MongoDB Administration 20110922radiocats

Hadoop training in bangaloreKelly Technologies

Big Data Technologies - HadoopTalentica Software

List of Engineering Colleges in UttarakhandRoorkee College of Engineering, Roorkee

Hadoop.pptxarslanhaneef

Hadoop.pptxsonukumar379092

Bigdata and HadoopGirish L

Big dataMayuri Verma

Hadoop EcosystemLior Sidi

Ähnlich wie MapReduce and Hadoop (20)

Intro to Big Data - Spark

Hadoop-Quick introduction

August 2013 HUG: Removing the NameNode's memory limitation

Distributed Computing with Apache Hadoop: Technology Overview

2. hadoop fundamentals

Hadoop introduction

Intro to big data choco devday - 23-01-2014

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY

Big data and hadoop anupama

Big Data & Hadoop Introduction

Hadoop ppt1

MongoDB Administration 20110922

Hadoop training in bangalore

Big Data Technologies - Hadoop

List of Engineering Colleges in Uttarakhand

Hadoop.pptx

Bigdata and Hadoop

Big data

Hadoop Ecosystem

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

A Call to Action for Generative AI in 2024Results

Slack Application Development 101 Slidespraypatel2

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

A Domino Admins Adventures (Engage 2024)Gabriella Davis

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

A Call to Action for Generative AI in 2024

Slack Application Development 101 Slides

How to Troubleshoot Apps for the Modern Connected Worker

Handwritten Text Recognition for manuscripts and early printed texts

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

A Domino Admins Adventures (Engage 2024)

My Hashitalk Indonesia April 2024 Presentation

Unblocking The Main Thread Solving ANRs and Frozen Frames

The Codex of Business Writing Software for Real-World Solutions 2.pptx

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Breaking the Kubernetes Kill Chain: Host Path Mount

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Understanding the Laravel MVC Architecture

MapReduce and Hadoop

1. Map Reduce and Hadoop - S A L IL NAVG IR E

2. Big Data Explosion • 90% of today's data was created in the last 2 years • Moore's law: Data volume doubles every 18 months • YouTube: 13 million hours and 700 billion views in 2010 • Facebook: 20TB/day (compressed) • CERN/LHC: 40TB/day (15PB/year) • Many more examples

3. Solution: Scalability How? Divide and Conquer

4. Challenges! • How to assign units of work to the workers? • What if there are more units of work than workers? • What if the workers need to share intermediate incomplete data? • How do we aggregate such intermediate data? • How do we know when all workers have completed their assignments? • What if some workers failed?

5. History • 2000: Apache Lucene: batch index updates and sort/merge with on disk index • 2002: Apache Nutch: distributed, scalable open source web crawler • 2004: Google publishes GFS and MapReduce papers • 2006: Apache Hadoop: open source Java implementation of GFS and MapReduce to solve Nutch’ problem; later becomes standalone project

6. What is Map Reduce? • A programming model to distribute a task on multiple nodes • Used to develop solutions that will process large amounts of data in a parallelized fashion in clusters of computing nodes • Original MapReduce paper by Google • Features of MapReduce: • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers

7. MapReduce Execution Overview User Program fork assign map Input Data Split 0 read Split 1 Split 2 fork Master fork assign reduce Worker Worker Worker local write Worker Worker remote read, sort write Output File 0 Output File 1

8. Hadoop Components Storage Processing HDFS MapReduce Self-healing high-bandwidth clustered storage Fault-tolerant distributed processing

9. HDFS Architecture

10. HDFS Basics • HDFS is a filesystem written in Java • Sits on top of a native filesystem • Provides redundant storage for massive amounts of data • Use Commodity devices

11. HDFS Data • Data is split into blocks and stored on multiple nodes in the cluster • Each block is usually 64 MB or 128 MB • Each block is replicated multiple times • Replicas stored on different data nodes

12. 2 Types of Nodes Slave Nodes Master Nodes

13. Master Node • NameNode • only 1 per cluster • metadata server and database • SecondaryNameNode helps with some housekeeping • JobTracker • only 1 per cluster • job scheduler

14. Slave Nodes • DataNodes • 1-4000 per cluster • block data storage • TaskTrackers • 1-4000 per cluster • task execution

15. NameNode • A single NameNode stores all metadata, replication of blocks and read/write access to files • Filenames, locations on DataNodes of each block, owner, group, etc. • All information maintained in RAM for fast lookup

16. Secondary NameNode • Does memory-intensive administrative functions for the NameNode • Should run on a separate machine

17. Data Node • DataNodes store file contents • Different blocks of the same file will be stored on different DataNodes • Same block is stored on three (or more) DataNodes for redundancy

18. Word Count Example • Input • Text files • Output • Single file containing (Word <TAB> Count) • Map Phase • Generates (Word, Count) pairs • [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase • For each word, calculates aggregate • [{a,7}, {b,5}, {c,6}]

19. Typical Cluster • 3-4000 commodity servers • Each server • 2x quad-core • 16-24 GB ram • 4-12 TB disk space • 20-30 servers per rack

20. When Should I use it? Good choice for jobs that can be broken into parallelized jobs: • Indexing/Analysis of log files • Sorting of large data sets • Image Processing/Machine Learning Bad choice for serial or low latency jobs: • For real-time processing • For processing intensive task with little data • Replacing MySQL

21. Who uses Hadoop?

MapReduce and Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie MapReduce and Hadoop

Ähnlich wie MapReduce and Hadoop (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MapReduce and Hadoop