SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
An Introduction to MapReduce
             Presented by Frane Bandov
    at the Operating Complex IT-Systems seminar
                  Berlin, 1/26/2010
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   2
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   3
Introduction – Problem
Sometimes we have to deal with huge amounts
                 of data
TBytes
250

200

 150

100

 50

  0
            You   Facebook              Yahoo! Groups    German Climate
                                                        Computing Centre

  2/16/10          An Introduction to MapReduce                       4
Introduction – Problem
    The data needs to be processed, but how?


     Can‘t process all of this data on one machine
     Distribute the processing to many machines




2/16/10             An Introduction to MapReduce     5
Introduction – Approach
           Distributed computing is the solution
           “Let’s write our own distributed computing
              software as a solution to our problem”
         Checklist
 design protocols             evelopment takes a long time
                              D
 design data structures
 write the code              Expensive: Cost-benefit ratio?
 assure failure tolerance



   Build complex software for simple computations?

 2/16/10                     An Introduction to MapReduce   6
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   7
Google MapReduce – Idea
      A framework for distributed computing

  Don‘t care about protocols, failure tolerance, etc.

           Just write your simple computation




2/16/10              An Introduction to MapReduce       8
Google MapReduce – Idea
              MapReduce Paradigm
Map:                                  Reduce:
 Apply function to all                  Combine all elements
 elements of a list                     of a list


square x = x * x;                     reduce (+)[1, 2, 3, 4, 5];
map square [1, 2, 3, 4, 5];
 [1, 4, 9, 16, 25]                    15




2/16/10               An Introduction to MapReduce                 9
Google MapReduce – Idea
               Basic functioning



      Input     Map                     Reduce   Output




2/16/10           An Introduction to MapReduce            10
Google MapReduce – Overview
                       MapReduce-Based User Program

 GFS                                                              GFS

 Split 1                              Master


 Split 2                      Intermediate
              Worker                                     Worker   File 1
                                  File 1

 Split 3
                              Intermediate
              Worker
                                  File 2                 Worker   File 2
 Split 4

                              Intermediate
 Split 5      Worker
                                  File 3
                                                         Reduce   Output
Input file   Map Phase                                   Phase     files
2/16/10                   An Introduction to MapReduce               11
MapReduce – Fault Tolerance
•  Workers are periodically pinged by master
•  No answer over certain time  worker failed

Mapper fails:
     –  Reset map job as idle
     –  Even if job was completed  intermediate files are
        inaccessible
     –  Notify reducers where to get the new intermediate file
Reducer fails:
     –  Reset its job as idle
2/16/10                   An Introduction to MapReduce       12
MapReduce – Fault Tolerance
Master fails:
     –  Periodically sets checkpoints
     –  In case of failure MapReduce-Operation is aborted
     –  Operation can be restarted from last checkpoint




2/16/10                An Introduction to MapReduce         13
Google MapReduce – GFS
               Google File System
•  In-house distributed file system at Google
•  Stores all input an output files
•  Stores files…
     – divided into 64 MB blocks
     – on at least 3 different machines
•  Machines running GFS also
   run MapReduce
2/16/10              An Introduction to MapReduce   14
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   15
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   16
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   17
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   18
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   19
Alternative Implementations
Apache Hadoop

•    Open-Source-Implementation in Java
•    Jobs can be written in C++, Java, Python, etc.
•    Used by Yahoo!, Facebook, Amazon and others
•    Most commonly used implementation
•    HDFS as open-source-implementation of GFS
•    Can also use Amazon S3, HTTP(S) or FTP
•    Extensions: Hive, Pig, HBase
2/16/10              An Introduction to MapReduce     20
Alternative Implementations
                              Mars
          MapReduce-Implementation for nVidia GPU
                using the CUDA framework

                    MapReduce-Cell
            Implementation for the Cell multi-core
                         processor

                             Qizmt
     MySpace’s implementation of MapReduce in C#

2/16/10                An Introduction to MapReduce   21
Alternative Implementations


     There are many other open- and closed-
     source implementations of MapReduce!




2/16/10           An Introduction to MapReduce   22
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   23
Reception and Criticism
•  Yahoo!: Hadoop on a 10,000 server cluster
•  Facebook analyses the daily log (25TB) on
   a 1,000 server cluster
•  Amazon Elastic MapReduce: Hadoop
   clusters for rent on EC2 and S3
•  IBM and Google: Support university
   courses in distributed programming
•  UC Berkley announced to teach freashmen
   programming MapReduce
2/16/10          An Introduction to MapReduce   24
Reception and Criticism




2/16/10          An Introduction to MapReduce   25
Reception and Criticism
•  Criticism mainly by RDBMS experts
   DeWitt and Stonebraker
•  MapReduce
     – is a step backwards in database access
     – is a poor implementation
     – is not novel
     – is missing features that are routinely provided
       by modern DBMSs
     – is incompatible with the DBMS tools
2/16/10              An Introduction to MapReduce    26
Reception and Criticism
               Response to criticism

              MapReduce is no RDBMS

   It suits well for processing and structuring huge
              amounts of unstructured data

      MapReduce's big inovation is that it enables
     distributing data processing across a network of
         cheap and possibly unreliable computers
2/16/10              An Introduction to MapReduce      27
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   28
Trends and Future Development
   Trend of utilizing MapReduce/Hadoop as
                 parallel database

•  Hive: Query language for Hadoop
•  HBase: Column-oriented distributed database
   (modeled after Google’s BigTable)
•  Map-Reduce-Merge: Adding merge to the
   paradigm allows implementing features of
   relational algebra
2/16/10           An Introduction to MapReduce   29
Trends and Future Development
   Trend to use the MapReduce-paradigm to
         better utilize multi-core CPUs

•  Qt Concurrent
     –  Simplified C++ version of MapReduce for distributing
        tasks between multiple processor cores
•  Mars
•  MapReduce-Cell


2/16/10                An Introduction to MapReduce        30
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   31
Conclusion
                        MapReduce

     provides an easy solution for the processing of
                  large amounts of data

          brings a paradigm shift in programming

                      changed the world,
          i.e. made data processing more efficient and
            cheaper, is the foundation of many other
                   approaches and solutions
2/16/10                 An Introduction to MapReduce     32
Questions?




2/16/10    An Introduction to MapReduce   33
Thank You!




2/16/10    An Introduction to MapReduce   34

Weitere ähnliche Inhalte

Was ist angesagt?

YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemSandip Darwade
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)Subhas Kumar Ghosh
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop et son Êcosystème
Hadoop et son ÊcosystèmeHadoop et son Êcosystème
Hadoop et son ÊcosystèmeKhanh Maudoux
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Büyük veri teknolojilerine giriş v1l
Büyük veri teknolojilerine giriş v1lBüyük veri teknolojilerine giriş v1l
Büyük veri teknolojilerine giriş v1lHakan Ilter
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 

Was ist angesagt? (20)

YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop et son Êcosystème
Hadoop et son ÊcosystèmeHadoop et son Êcosystème
Hadoop et son Êcosystème
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Büyük veri teknolojilerine giriş v1l
Büyük veri teknolojilerine giriş v1lBüyük veri teknolojilerine giriş v1l
Büyük veri teknolojilerine giriş v1l
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 

Ähnlich wie An Introduction to MapReduce

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtechJakir Hossain
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurationsdbpublications
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Ankit Gupta
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming ModelAdarshaDhakal
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportBhushan Kulkarni
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
E031201032036
E031201032036E031201032036
E031201032036ijceronline
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)Yu Liu
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data TechnologyJuan J. Mostazo
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersKoichi Shirahata
 

Ähnlich wie An Introduction to MapReduce (20)

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
E031201032036
E031201032036E031201032036
E031201032036
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
 

KĂźrzlich hochgeladen

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

KĂźrzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

An Introduction to MapReduce

  • 1. An Introduction to MapReduce Presented by Frane Bandov at the Operating Complex IT-Systems seminar Berlin, 1/26/2010
  • 2. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 2
  • 3. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 3
  • 4. Introduction – Problem Sometimes we have to deal with huge amounts of data TBytes 250 200 150 100 50 0 You Facebook Yahoo! Groups German Climate Computing Centre 2/16/10 An Introduction to MapReduce 4
  • 5. Introduction – Problem The data needs to be processed, but how? Can‘t process all of this data on one machine  Distribute the processing to many machines 2/16/10 An Introduction to MapReduce 5
  • 6. Introduction – Approach Distributed computing is the solution “Let’s write our own distributed computing software as a solution to our problem” Checklist  design protocols   evelopment takes a long time D  design data structures  write the code  Expensive: Cost-benefit ratio?  assure failure tolerance Build complex software for simple computations? 2/16/10 An Introduction to MapReduce 6
  • 7. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 7
  • 8. Google MapReduce – Idea A framework for distributed computing Don‘t care about protocols, failure tolerance, etc. Just write your simple computation 2/16/10 An Introduction to MapReduce 8
  • 9. Google MapReduce – Idea MapReduce Paradigm Map: Reduce: Apply function to all Combine all elements elements of a list of a list square x = x * x; reduce (+)[1, 2, 3, 4, 5]; map square [1, 2, 3, 4, 5];  [1, 4, 9, 16, 25]  15 2/16/10 An Introduction to MapReduce 9
  • 10. Google MapReduce – Idea Basic functioning Input Map Reduce Output 2/16/10 An Introduction to MapReduce 10
  • 11. Google MapReduce – Overview MapReduce-Based User Program GFS GFS Split 1 Master Split 2 Intermediate Worker Worker File 1 File 1 Split 3 Intermediate Worker File 2 Worker File 2 Split 4 Intermediate Split 5 Worker File 3 Reduce Output Input file Map Phase Phase files 2/16/10 An Introduction to MapReduce 11
  • 12. MapReduce – Fault Tolerance •  Workers are periodically pinged by master •  No answer over certain time  worker failed Mapper fails: –  Reset map job as idle –  Even if job was completed  intermediate files are inaccessible –  Notify reducers where to get the new intermediate file Reducer fails: –  Reset its job as idle 2/16/10 An Introduction to MapReduce 12
  • 13. MapReduce – Fault Tolerance Master fails: –  Periodically sets checkpoints –  In case of failure MapReduce-Operation is aborted –  Operation can be restarted from last checkpoint 2/16/10 An Introduction to MapReduce 13
  • 14. Google MapReduce – GFS Google File System •  In-house distributed file system at Google •  Stores all input an output files •  Stores files… – divided into 64 MB blocks – on at least 3 different machines •  Machines running GFS also run MapReduce 2/16/10 An Introduction to MapReduce 14
  • 15. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 15
  • 16. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 16
  • 17. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 17
  • 18. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 18
  • 19. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 19
  • 20. Alternative Implementations Apache Hadoop •  Open-Source-Implementation in Java •  Jobs can be written in C++, Java, Python, etc. •  Used by Yahoo!, Facebook, Amazon and others •  Most commonly used implementation •  HDFS as open-source-implementation of GFS •  Can also use Amazon S3, HTTP(S) or FTP •  Extensions: Hive, Pig, HBase 2/16/10 An Introduction to MapReduce 20
  • 21. Alternative Implementations Mars MapReduce-Implementation for nVidia GPU using the CUDA framework MapReduce-Cell Implementation for the Cell multi-core processor Qizmt MySpace’s implementation of MapReduce in C# 2/16/10 An Introduction to MapReduce 21
  • 22. Alternative Implementations There are many other open- and closed- source implementations of MapReduce! 2/16/10 An Introduction to MapReduce 22
  • 23. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 23
  • 24. Reception and Criticism •  Yahoo!: Hadoop on a 10,000 server cluster •  Facebook analyses the daily log (25TB) on a 1,000 server cluster •  Amazon Elastic MapReduce: Hadoop clusters for rent on EC2 and S3 •  IBM and Google: Support university courses in distributed programming •  UC Berkley announced to teach freashmen programming MapReduce 2/16/10 An Introduction to MapReduce 24
  • 25. Reception and Criticism 2/16/10 An Introduction to MapReduce 25
  • 26. Reception and Criticism •  Criticism mainly by RDBMS experts DeWitt and Stonebraker •  MapReduce – is a step backwards in database access – is a poor implementation – is not novel – is missing features that are routinely provided by modern DBMSs – is incompatible with the DBMS tools 2/16/10 An Introduction to MapReduce 26
  • 27. Reception and Criticism Response to criticism MapReduce is no RDBMS It suits well for processing and structuring huge amounts of unstructured data MapReduce's big inovation is that it enables distributing data processing across a network of cheap and possibly unreliable computers 2/16/10 An Introduction to MapReduce 27
  • 28. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 28
  • 29. Trends and Future Development Trend of utilizing MapReduce/Hadoop as parallel database •  Hive: Query language for Hadoop •  HBase: Column-oriented distributed database (modeled after Google’s BigTable) •  Map-Reduce-Merge: Adding merge to the paradigm allows implementing features of relational algebra 2/16/10 An Introduction to MapReduce 29
  • 30. Trends and Future Development Trend to use the MapReduce-paradigm to better utilize multi-core CPUs •  Qt Concurrent –  Simplified C++ version of MapReduce for distributing tasks between multiple processor cores •  Mars •  MapReduce-Cell 2/16/10 An Introduction to MapReduce 30
  • 31. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 31
  • 32. Conclusion MapReduce provides an easy solution for the processing of large amounts of data brings a paradigm shift in programming changed the world, i.e. made data processing more efficient and cheaper, is the foundation of many other approaches and solutions 2/16/10 An Introduction to MapReduce 32
  • 33. Questions? 2/16/10 An Introduction to MapReduce 33
  • 34. Thank You! 2/16/10 An Introduction to MapReduce 34