SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Map Reduce and
Hadoop
- S A L IL NAVG IR E
Big Data Explosion
• 90% of today's data was created in the last 2 years

• Moore's law: Data volume doubles every 18
months
• YouTube: 13 million hours and 700 billion views in
2010
• Facebook: 20TB/day (compressed)
• CERN/LHC: 40TB/day (15PB/year)

• Many more examples
Solution: Scalability
How?
Divide and Conquer
Challenges!
• How to assign units of work to the workers?

• What if there are more units of work than workers?
• What if the workers need to share intermediate
incomplete data?

• How do we aggregate such intermediate data?
• How do we know when all workers have completed
their assignments?

• What if some workers failed?
History
• 2000: Apache Lucene: batch index updates and
sort/merge with on disk index
• 2002: Apache Nutch: distributed, scalable open
source web crawler
• 2004: Google publishes GFS and MapReduce
papers
• 2006: Apache Hadoop: open source Java
implementation of GFS and MapReduce to solve
Nutch’ problem; later becomes standalone project
What is Map Reduce?
• A programming model to distribute a task on
multiple nodes
• Used to develop solutions that will process large
amounts of data in a parallelized fashion in clusters
of computing nodes

• Original MapReduce paper by Google
• Features of MapReduce:
• Fault-tolerance
• Status and monitoring tools
• A clean abstraction for programmers
MapReduce Execution Overview
User
Program
fork
assign
map
Input Data
Split 0
read
Split 1
Split 2

fork
Master

fork

assign
reduce

Worker

Worker

Worker

local
write

Worker
Worker

remote
read,
sort

write

Output
File 0
Output
File 1
Hadoop Components

Storage

Processing

HDFS

MapReduce

Self-healing
high-bandwidth
clustered storage

Fault-tolerant
distributed
processing
HDFS Architecture
HDFS Basics
• HDFS is a filesystem written in Java
• Sits on top of a native filesystem
• Provides redundant storage for massive
amounts of data

• Use Commodity devices
HDFS Data
• Data is split into blocks and stored on
multiple nodes in the cluster
• Each block is usually 64 MB or 128 MB
• Each block is replicated multiple times
• Replicas stored on different data nodes
2 Types of Nodes
Slave Nodes
Master Nodes
Master Node
• NameNode
• only 1 per cluster
• metadata server and database
• SecondaryNameNode helps with some housekeeping

• JobTracker
• only 1 per cluster
• job scheduler
Slave Nodes
• DataNodes
• 1-4000 per cluster
• block data storage

• TaskTrackers
• 1-4000 per cluster
• task execution
NameNode
• A single NameNode stores all
metadata, replication of blocks and
read/write access to files
• Filenames, locations on DataNodes of each
block, owner, group, etc.
• All information maintained in RAM for fast
lookup
Secondary NameNode
• Does memory-intensive administrative
functions for the NameNode
• Should run on a separate machine
Data Node
• DataNodes store file contents
• Different blocks of the same file will be
stored on different DataNodes
• Same block is stored on three (or more)
DataNodes for redundancy
Word Count Example
• Input
• Text files

• Output
• Single file containing (Word <TAB> Count)

• Map Phase
• Generates (Word, Count) pairs
• [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]

• Reduce Phase
• For each word, calculates aggregate
• [{a,7}, {b,5}, {c,6}]
Typical Cluster
• 3-4000 commodity servers

• Each server
• 2x quad-core
• 16-24 GB ram

• 4-12 TB disk space

• 20-30 servers per rack
When Should I use it?
Good choice for jobs that can be broken into parallelized jobs:

• Indexing/Analysis of log files
• Sorting of large data sets
• Image Processing/Machine Learning

Bad choice for serial or low latency jobs:
• For real-time processing
• For processing intensive task with little data
• Replacing MySQL
Who uses Hadoop?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemKey Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemYahoo Developer Network
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Eric Evans
 
Gfs and map redusing
Gfs and map redusingGfs and map redusing
Gfs and map redusingilashanawaz
 
Mongo db cluster administration and Shredded Databases
Mongo db cluster administration and Shredded DatabasesMongo db cluster administration and Shredded Databases
Mongo db cluster administration and Shredded DatabasesAbhinav Jha
 
Using MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesUsing MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesAndrás Fehér
 
Mongo presentation conf
Mongo presentation confMongo presentation conf
Mongo presentation confShridhar Joshi
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDeltares
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
SANSA ISWC 2017 Talk
SANSA ISWC 2017 TalkSANSA ISWC 2017 Talk
SANSA ISWC 2017 TalkJens Lehmann
 
Big Data Overview Part 1
Big Data Overview Part 1Big Data Overview Part 1
Big Data Overview Part 1William Simms
 
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014Gerd König
 

Was ist angesagt? (20)

Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemKey Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
 
Dataspace presentatie
Dataspace presentatieDataspace presentatie
Dataspace presentatie
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Gfs and map redusing
Gfs and map redusingGfs and map redusing
Gfs and map redusing
 
Mongo db cluster administration and Shredded Databases
Mongo db cluster administration and Shredded DatabasesMongo db cluster administration and Shredded Databases
Mongo db cluster administration and Shredded Databases
 
RubiX
RubiXRubiX
RubiX
 
Using MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesUsing MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 Minutes
 
Mongo presentation conf
Mongo presentation confMongo presentation conf
Mongo presentation conf
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
 
Why Spark for large scale data analysis
Why Spark for large scale data analysisWhy Spark for large scale data analysis
Why Spark for large scale data analysis
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
SANSA ISWC 2017 Talk
SANSA ISWC 2017 TalkSANSA ISWC 2017 Talk
SANSA ISWC 2017 Talk
 
Big Data Overview Part 1
Big Data Overview Part 1Big Data Overview Part 1
Big Data Overview Part 1
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
 
Barcamp MySQL
Barcamp MySQLBarcamp MySQL
Barcamp MySQL
 
Mashing the data
Mashing the dataMashing the data
Mashing the data
 
Why geoserver
Why geoserverWhy geoserver
Why geoserver
 

Ähnlich wie MapReduce and Hadoop

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation Yahoo Developer Network
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014Hassan Islamov
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
MongoDB Administration 20110922
MongoDB Administration 20110922MongoDB Administration 20110922
MongoDB Administration 20110922radiocats
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and HadoopGirish L
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 

Ähnlich wie MapReduce and Hadoop (20)

Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
MongoDB Administration 20110922
MongoDB Administration 20110922MongoDB Administration 20110922
MongoDB Administration 20110922
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
 
Big data
Big dataBig data
Big data
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

MapReduce and Hadoop

  • 1. Map Reduce and Hadoop - S A L IL NAVG IR E
  • 2. Big Data Explosion • 90% of today's data was created in the last 2 years • Moore's law: Data volume doubles every 18 months • YouTube: 13 million hours and 700 billion views in 2010 • Facebook: 20TB/day (compressed) • CERN/LHC: 40TB/day (15PB/year) • Many more examples
  • 4. Challenges! • How to assign units of work to the workers? • What if there are more units of work than workers? • What if the workers need to share intermediate incomplete data? • How do we aggregate such intermediate data? • How do we know when all workers have completed their assignments? • What if some workers failed?
  • 5. History • 2000: Apache Lucene: batch index updates and sort/merge with on disk index • 2002: Apache Nutch: distributed, scalable open source web crawler • 2004: Google publishes GFS and MapReduce papers • 2006: Apache Hadoop: open source Java implementation of GFS and MapReduce to solve Nutch’ problem; later becomes standalone project
  • 6. What is Map Reduce? • A programming model to distribute a task on multiple nodes • Used to develop solutions that will process large amounts of data in a parallelized fashion in clusters of computing nodes • Original MapReduce paper by Google • Features of MapReduce: • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers
  • 7. MapReduce Execution Overview User Program fork assign map Input Data Split 0 read Split 1 Split 2 fork Master fork assign reduce Worker Worker Worker local write Worker Worker remote read, sort write Output File 0 Output File 1
  • 10. HDFS Basics • HDFS is a filesystem written in Java • Sits on top of a native filesystem • Provides redundant storage for massive amounts of data • Use Commodity devices
  • 11. HDFS Data • Data is split into blocks and stored on multiple nodes in the cluster • Each block is usually 64 MB or 128 MB • Each block is replicated multiple times • Replicas stored on different data nodes
  • 12. 2 Types of Nodes Slave Nodes Master Nodes
  • 13. Master Node • NameNode • only 1 per cluster • metadata server and database • SecondaryNameNode helps with some housekeeping • JobTracker • only 1 per cluster • job scheduler
  • 14. Slave Nodes • DataNodes • 1-4000 per cluster • block data storage • TaskTrackers • 1-4000 per cluster • task execution
  • 15. NameNode • A single NameNode stores all metadata, replication of blocks and read/write access to files • Filenames, locations on DataNodes of each block, owner, group, etc. • All information maintained in RAM for fast lookup
  • 16. Secondary NameNode • Does memory-intensive administrative functions for the NameNode • Should run on a separate machine
  • 17. Data Node • DataNodes store file contents • Different blocks of the same file will be stored on different DataNodes • Same block is stored on three (or more) DataNodes for redundancy
  • 18. Word Count Example • Input • Text files • Output • Single file containing (Word <TAB> Count) • Map Phase • Generates (Word, Count) pairs • [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase • For each word, calculates aggregate • [{a,7}, {b,5}, {c,6}]
  • 19. Typical Cluster • 3-4000 commodity servers • Each server • 2x quad-core • 16-24 GB ram • 4-12 TB disk space • 20-30 servers per rack
  • 20. When Should I use it? Good choice for jobs that can be broken into parallelized jobs: • Indexing/Analysis of log files • Sorting of large data sets • Image Processing/Machine Learning Bad choice for serial or low latency jobs: • For real-time processing • For processing intensive task with little data • Replacing MySQL