SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
MapReduce and HDFS

This presentation includes course content © University of Washington
 Redistributed under the Creative Commons Attribution 3.0 license.
                         All other contents:
                       © 2009 Cloudera, Inc.
Overview
• Why MapReduce?
• What is MapReduce?
• The Hadoop Distributed File System




                © 2009 Cloudera, Inc.
How MapReduce is Structured
• Functional programming meets distributed
  computing
• A batch data processing system
• Factors out many reliability concerns from
  application logic




                 © 2009 Cloudera, Inc.
MapReduce Provides:
•   Automatic parallelization & distribution
•   Fault-tolerance
•   Status and monitoring tools
•   A clean abstraction for programmers




                    © 2009 Cloudera, Inc.
Programming Model
• Borrows from functional programming
• Users implement interface of two
  functions:
  – map (in_key, in_value) ->
     (out_key, intermediate_value) list

  – reduce (out_key, intermediate_value list) ->
     out_value list




                   © 2009 Cloudera, Inc.
map
• Records from the data source (lines out of
  files, rows of a database, etc) are fed into
  the map function as key*value pairs: e.g.,
  (filename, line).
• map() produces one or more intermediate
  values along with an output key from the
  input.


                  © 2009 Cloudera, Inc.
map
map (in_key, in_value) ->
  (out_key, intermediate_value) list




              © 2009 Cloudera, Inc.
Example: Upper-case Mapper
let map(k, v) =
  emit(k.toUpper(), v.toUpper())

(“foo”, “bar”)      (“FOO”, “BAR”)
(“Foo”, “other”)      (“FOO”, “OTHER”)
(“key2”, “data”)      (“KEY2”, “DATA”)




                   © 2009 Cloudera, Inc.
Example: Explode Mapper
let map(k, v) =
  foreach char c in v:
    emit(k, c)

(“A”, “cats”)    (“A”, “c”), (“A”, “a”),
                 (“A”, “t”), (“A”, “s”)

(“B”, “hi”)     (“B”, “h”), (“B”, “i”)




                  © 2009 Cloudera, Inc.
Example: Filter Mapper
let map(k, v) =
  if (isPrime(v)) then emit(k, v)

(“foo”, 7)     (“foo”, 7)
(“test”, 10)     (nothing)




                  © 2009 Cloudera, Inc.
Example: Changing Keyspaces
let map(k, v) = emit(v.length(), v)

(“hi”, “test”)    (4, “test”)
(“x”, “quux”)    (4, “quux”)
(“y”, “abracadabra”)    (10, “abracadabra”)




                 © 2009 Cloudera, Inc.
reduce
• After the map phase is over, all the
  intermediate values for a given output key
  are combined together into a list
• reduce() combines those intermediate
  values into one or more final values for
  that same output key
• (in practice, usually only one final value
  per key)

                 © 2009 Cloudera, Inc.
reduce
reduce (out_key, intermediate_value list) ->
         out_value list




                  © 2009 Cloudera, Inc.
Example: Sum Reducer
let reduce(k, vals) =
  sum = 0
  foreach int v in vals:
    sum += v
  emit(k, sum)

(“A”, [42, 100, 312])    (“A”, 454)
(“B”, [12, 6, -2])    (“B”, 16)




                 © 2009 Cloudera, Inc.
Example: Identity Reducer
let reduce(k, vals) =
  foreach v in vals:
    emit(k, v)

(“A”, [42, 100, 312])    (“A”, 42),
                     (“A”, 100), (“A”, 312)

(“B”, [12, 6, -2])        (“B”, 12), (“B”, 6),
                         (“B”, -2)


                 © 2009 Cloudera, Inc.
© 2009 Cloudera, Inc.
Parallelism
• map() functions run in parallel, creating
  different intermediate values from different
  input data sets
• reduce() functions also run in parallel,
  each working on a different output key
• All values are processed independently
• Bottleneck: reduce phase can’t start until
  map phase is completely finished.


                  © 2009 Cloudera, Inc.
Example: Count word occurrences
map(String input_key, String input_value):
 // input_key: document name
 // input_value: document contents
 for each word w in input_value:
   emit(w, 1);

reduce(String output_key, Iterator<int>
  intermediate_values):
 // output_key: a word
 // output_values: a list of counts
 int result = 0;
 for each v in intermediate_values:
   result += v;
 emit(output_key, result);
                   © 2009 Cloudera, Inc.
Combining Phase
• Run on mapper nodes after map phase
• “Mini-reduce,” only on local map output
• Used to save bandwidth before sending
  data to full reducer
• Reducer can be combiner if commutative
  & associative
  – e.g., SumReducer


                 © 2009 Cloudera, Inc.
Combiner, graphically




       © 2009 Cloudera, Inc.
WordCount Redux
map(String input_key, String input_value):
 // input_key: document name
 // input_value: document contents
 for each word w in input_value:
   emit(w, 1);

reduce(String output_key, Iterator<int>
  intermediate_values):
 // output_key: a word
 // output_values: a list of counts
 int result = 0;
 for each v in intermediate_values:
   result += v;
 emit(output_key, result);
                   © 2009 Cloudera, Inc.
MapReduce Conclusions
• MapReduce has proven to be a useful
  abstraction in many areas
• Greatly simplifies large-scale computations
• Functional programming paradigm can be
  applied to large-scale applications
• You focus on the “real” problem, library deals
  with messy details




                    © 2009 Cloudera, Inc.
HDFS



Some slides designed by Alex Moschuk, University of Washington
Redistributed under the Creative Commons Attribution 3.0 license
HDFS: Motivation

• Based on Google’s GFS
• Redundant storage of massive amounts of
  data on cheap and unreliable computers
• Why not use an existing file system?
  – Different workload and design priorities
  – Handles much bigger dataset sizes than other
    filesystems


                  © 2009 Cloudera, Inc.
Assumptions
• High component failure rates
   – Inexpensive commodity components fail all the
     time
• “Modest” number of HUGE files
   – Just a few million
   – Each is 100MB or larger; multi-GB files typical
• Files are write-once, mostly appended to
   – Perhaps concurrently
• Large streaming reads
• High sustained throughput favored over low latency
                    © 2009 Cloudera, Inc.
HDFS Design Decisions

• Files stored as blocks
   – Much larger size than most filesystems (default is 64MB)
• Reliability through replication
   – Each block replicated across 3+ DataNodes
• Single master (NameNode) coordinates access, metadata
   – Simple centralized management
• No data caching
   – Little benefit due to large data sets, streaming reads
• Familiar interface, but customize the API
   – Simplify the problem; focus on distributed apps

                        © 2009 Cloudera, Inc.
HDFS Client Block Diagram
          $   "


                                          "




    ###                 !    !




                  © 2009 Cloudera, Inc.
Based on GFS Architecture




            Figure from “The Google File System,”
            Ghemawat et. al., SOSP 2003

         © 2009 Cloudera, Inc.
Metadata

• Single NameNode stores all metadata
  – Filenames, locations on DataNodes of each file
• Maintained entirely in RAM for fast lookup

• DataNodes store opaque file contents in
  “block” objects on underlying local filesystem




                    © 2009 Cloudera, Inc.
HDFS Conclusions

• HDFS supports large-scale processing
  workloads on commodity hardware
  – designed to tolerate frequent component failures
  – optimized for huge files that are mostly appended and
    read
  – filesystem interface is customized for the job, but still
    retains familiarity for developers
  – simple solutions can work (e.g., single master)


• Reliably stores several TB in individual clusters
                       © 2009 Cloudera, Inc.
2 Map Reduce And Hdfs

Weitere ähnliche Inhalte

Was ist angesagt?

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 

Was ist angesagt? (20)

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
01 hbase
01 hbase01 hbase
01 hbase
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hive
HiveHive
Hive
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 

Andere mochten auch

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Andere mochten auch (6)

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Ähnlich wie 2 Map Reduce And Hdfs

Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones WebSantiago Coffey
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechCloudera Japan
 

Ähnlich wie 2 Map Reduce And Hdfs (20)

Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
 
Spark etl
Spark etlSpark etl
Spark etl
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
 

Kürzlich hochgeladen

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Kürzlich hochgeladen (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

2 Map Reduce And Hdfs

  • 1. MapReduce and HDFS This presentation includes course content © University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: © 2009 Cloudera, Inc.
  • 2. Overview • Why MapReduce? • What is MapReduce? • The Hadoop Distributed File System © 2009 Cloudera, Inc.
  • 3. How MapReduce is Structured • Functional programming meets distributed computing • A batch data processing system • Factors out many reliability concerns from application logic © 2009 Cloudera, Inc.
  • 4. MapReduce Provides: • Automatic parallelization & distribution • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers © 2009 Cloudera, Inc.
  • 5. Programming Model • Borrows from functional programming • Users implement interface of two functions: – map (in_key, in_value) -> (out_key, intermediate_value) list – reduce (out_key, intermediate_value list) -> out_value list © 2009 Cloudera, Inc.
  • 6. map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. © 2009 Cloudera, Inc.
  • 7. map map (in_key, in_value) -> (out_key, intermediate_value) list © 2009 Cloudera, Inc.
  • 8. Example: Upper-case Mapper let map(k, v) = emit(k.toUpper(), v.toUpper()) (“foo”, “bar”) (“FOO”, “BAR”) (“Foo”, “other”) (“FOO”, “OTHER”) (“key2”, “data”) (“KEY2”, “DATA”) © 2009 Cloudera, Inc.
  • 9. Example: Explode Mapper let map(k, v) = foreach char c in v: emit(k, c) (“A”, “cats”) (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”) (“B”, “hi”) (“B”, “h”), (“B”, “i”) © 2009 Cloudera, Inc.
  • 10. Example: Filter Mapper let map(k, v) = if (isPrime(v)) then emit(k, v) (“foo”, 7) (“foo”, 7) (“test”, 10) (nothing) © 2009 Cloudera, Inc.
  • 11. Example: Changing Keyspaces let map(k, v) = emit(v.length(), v) (“hi”, “test”) (4, “test”) (“x”, “quux”) (4, “quux”) (“y”, “abracadabra”) (10, “abracadabra”) © 2009 Cloudera, Inc.
  • 12. reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key • (in practice, usually only one final value per key) © 2009 Cloudera, Inc.
  • 13. reduce reduce (out_key, intermediate_value list) -> out_value list © 2009 Cloudera, Inc.
  • 14. Example: Sum Reducer let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum) (“A”, [42, 100, 312]) (“A”, 454) (“B”, [12, 6, -2]) (“B”, 16) © 2009 Cloudera, Inc.
  • 15. Example: Identity Reducer let reduce(k, vals) = foreach v in vals: emit(k, v) (“A”, [42, 100, 312]) (“A”, 42), (“A”, 100), (“A”, 312) (“B”, [12, 6, -2]) (“B”, 12), (“B”, 6), (“B”, -2) © 2009 Cloudera, Inc.
  • 17. Parallelism • map() functions run in parallel, creating different intermediate values from different input data sets • reduce() functions also run in parallel, each working on a different output key • All values are processed independently • Bottleneck: reduce phase can’t start until map phase is completely finished. © 2009 Cloudera, Inc.
  • 18. Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: emit(w, 1); reduce(String output_key, Iterator<int> intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += v; emit(output_key, result); © 2009 Cloudera, Inc.
  • 19. Combining Phase • Run on mapper nodes after map phase • “Mini-reduce,” only on local map output • Used to save bandwidth before sending data to full reducer • Reducer can be combiner if commutative & associative – e.g., SumReducer © 2009 Cloudera, Inc.
  • 20. Combiner, graphically © 2009 Cloudera, Inc.
  • 21. WordCount Redux map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: emit(w, 1); reduce(String output_key, Iterator<int> intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += v; emit(output_key, result); © 2009 Cloudera, Inc.
  • 22. MapReduce Conclusions • MapReduce has proven to be a useful abstraction in many areas • Greatly simplifies large-scale computations • Functional programming paradigm can be applied to large-scale applications • You focus on the “real” problem, library deals with messy details © 2009 Cloudera, Inc.
  • 23. HDFS Some slides designed by Alex Moschuk, University of Washington Redistributed under the Creative Commons Attribution 3.0 license
  • 24. HDFS: Motivation • Based on Google’s GFS • Redundant storage of massive amounts of data on cheap and unreliable computers • Why not use an existing file system? – Different workload and design priorities – Handles much bigger dataset sizes than other filesystems © 2009 Cloudera, Inc.
  • 25. Assumptions • High component failure rates – Inexpensive commodity components fail all the time • “Modest” number of HUGE files – Just a few million – Each is 100MB or larger; multi-GB files typical • Files are write-once, mostly appended to – Perhaps concurrently • Large streaming reads • High sustained throughput favored over low latency © 2009 Cloudera, Inc.
  • 26. HDFS Design Decisions • Files stored as blocks – Much larger size than most filesystems (default is 64MB) • Reliability through replication – Each block replicated across 3+ DataNodes • Single master (NameNode) coordinates access, metadata – Simple centralized management • No data caching – Little benefit due to large data sets, streaming reads • Familiar interface, but customize the API – Simplify the problem; focus on distributed apps © 2009 Cloudera, Inc.
  • 27. HDFS Client Block Diagram $ " " ### ! ! © 2009 Cloudera, Inc.
  • 28. Based on GFS Architecture Figure from “The Google File System,” Ghemawat et. al., SOSP 2003 © 2009 Cloudera, Inc.
  • 29. Metadata • Single NameNode stores all metadata – Filenames, locations on DataNodes of each file • Maintained entirely in RAM for fast lookup • DataNodes store opaque file contents in “block” objects on underlying local filesystem © 2009 Cloudera, Inc.
  • 30. HDFS Conclusions • HDFS supports large-scale processing workloads on commodity hardware – designed to tolerate frequent component failures – optimized for huge files that are mostly appended and read – filesystem interface is customized for the job, but still retains familiarity for developers – simple solutions can work (e.g., single master) • Reliably stores several TB in individual clusters © 2009 Cloudera, Inc.