SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Hadoop & MapReduce
          Dr. Ioannis Konstantinou
      http://www.cslab.ntua.gr/~ikons


           AWS Usergroup Greece
               18/07/2012


        Computing Systems Laboratory
 School of Electrical and Computer Engineering
    National Technical University of Athens
Big Data
90% of today's data was created in the last 2 years
Moore's law: Data volume doubles every 18 months
YouTube: 13 million hours and 700 billion views in 2010
Facebook: 20TB/day (compressed)
CERN/LHC: 40TB/day (15PB/year)

Many more examples
Web logs, presentation files, medical files etc
Problem: Data explosion

   1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes)
   Data traffic of mobile telephony in the USA in 2010



       1.2 ZB (Zettabyte) = 1200 EB
       Total of digital data in 2010



        35 ZB (Zettabyte = 1021 bytes)
        Estimate for volume of total digital
        data in 2020
Solution: scalability


           How?
Source: Wikipedia (IBM Roadrunner)
Divide and Conquer
           “Problem”
                                  Partition


  w1          w2         w3

“worker”    “worker”   “worker”


   r1         r2         r3




           “Result”               Combine
Parallelization challenges
 How to assign units of work to the workers?
 What if there are more units of work than workers?
 What if the workers need to share intermediate incomplete
  data?
 How do we aggregate such intermediate data?
 How do we know when all workers have completed their
  assignments?
 What if some workers failed?
What is MapReduce?
A programming model
A programming framework
Used to develop solutions that will
    Process large amounts of data in a parallelized fashion

    In clusters of computing nodes

Originally a closed-source implementation at Google
    Scientific papers of ’03 & ’04 describe the framework

Hadoop: opensource implementation of the algorithms described in
  the scientific papers
    http://hadoop.apache.org/
What is Hadoop?
 2 large subsystems, 1 for data management & 1 for computation:
     HDFS (Hadoop Distributed File System)

     MapReduce computation framework runs above HDFS

     HDFS is essentially the I/O of Hadoop

 Written in java: A set of java processes running in multiple nodes

 Who uses it:
     Yahoo!

     Amazon

     Facebook

     Twitter

     Plus many more...
HDFS – distributed file system

 A scalable distributed file system for applications dealing with
  large data sets.
    Distributed: runs in a cluster

    Scalable: 10Κ nodes, 100Κ files 10PB storage

 Storage space is seamless for the whole cluster
 Files broken into blocks
 Typical block size: 128 MB.
 Replication: Each block copied to multiple data nodes.
Architecture of HDFS/MapReduce
 Master/Slave scheme
    HDFS: A central NameNode administers multple DataNodes

        NameNode: holds information about which DataNode holds which files
        DataNodes: «dummy» servers that hold raw file chunks
    MapReduce: A central JobTracker administers multiple TaskTrackers

-NameNode and JobTracker
   They run on the master
-DataNode and TaskTracker
   They run on the slaves
MapReduce
The problem is broken down in 2 phases.
   ●
       Map: Non overlapping sets of data input
       (<key,value> records) are assigned to different
       processes (mappers) that produce a set of
       intermediate <key,value> results
   ●
       Reduce: Data of Map phase are fed to a typically
       smaller number of processes(reducers) that
       aggregate the input results to a smaller number of
       <key,value> records.
How does it work?
Initialization phase
Input is uploaded to HDFS and is split into pieces of
 fixed size
Each TaskTracker node that participates in the
 computation is executing a copy of the MapReduce
 program
One of the nodes plays the JobTracker master role.
 This node will assign tasks to the rest (workers). Tasks
 can either be of type map or reduce.
JobTracker (Master)
The jobTracker holds data about:
  Status of tasks

  Location of input, output and intermediate data (runs
    together with NameNode - HDFS master)
The master is responsible for timecheduling of work
 tasks execution.
TaskTracker (Slave)
The TaskTracker runs tasks assigned by the master.
Runs at the same node as the DataNode (HFDS slave)
Task can be either of type Map or type Reduce
Typically the maximum number of concurrent tasks
 that can be run by a node is equal to the number of
 cpu cores it has (achieving optimal CPU utilization)
Map task
 A worker (TaskTracker) that has been assigned a map task
    ●
        Reads the relevant input data (input split) from HDFS, analyzes the <key, value>
        pairs and the output is passed as input to the map function.
    ●
        The map function processes the pairs and produces intermediate pairs that are
        aggregated in memory.
    ●
        Periodically a partition function is executed which stores the intermediate key-
        value pairs in the local node storage, while grouping them in R sets.This function
        is user defined.
    ●
        When the partition function completes the storage of the key-value pairs it
        informs the master that the task is complete and where the data are stored.
    ●
        The master forwards this information to the workers that run the reduce tasks
Reduce task
 A worker that has been assigned a reduce task

    Reads from every map process that has been executed the pairs that
      correspond to itself based on the locations instructed by the master.
    When all intermediate pairs have been retrieved they are sorted based on
      their key. Entries with the same key are grouped together.
    Function reduce is executed with input the pairs <key, group_of_values>
      that were the result of the previous phase.
    The reduce task processes the input data and produces the final pairs.

    The output pairs are attached in a file in the local file system. When the
      reduce task is completed the file becomes available in the distributed file
      system.
Task Completion
When a worker has completed its task it informs
 the master.
When all workers have informed the master then
 the master will return the function to the original
 program of the user.
Example
                    Master




         worker
          Map                Reduce
                             worker
Part 1


Part 2
Input    worker
          Map                Reduce
                             worker   Output

Part 3
         worker
          Map                Reduce
                             worker
MapReduce
Example: Word count 1/3

 Objective: measure the frequency of appearance of words in a large set
  of documents
 Potential use case: Discovery of popular url in a set of webserver
  logfiles
 Implementation plan:
    “Upload” documents on MapReduce

    Author a map function

    Author a reduce function

    Run a MapReduce task

    Retrieve results
Example: Word count 2/3
map(key, value):
// key: document name; value: text of document
     for each word w in value:
         emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
   result = 0
   for each count v in values:
      result += v
   emit(result)
Example: Word count 3/3
                              (w1, 2)   (w1,2)
    (d1, ‘’w1 w2 w4’)
                              (w2, 3)   (w2,3)
  (d2, ‘ w1 w2 w3 w4’)
                              (w3, 2)   (w1,3)
    (d3, ‘ w2 w3 w4’)
                              (w4,3)    (w2,4)
                                        (w1,3)           (w1,7)
                                        (w2,3)           (w2,15)
     (d4, ‘ w1 w2 w3’)        (w1,3)
     (d5, ‘w1 w3 w4’)         (w2,4)
(d6, ‘ w1 w4 w2 w2)           (w3,2)
    (d7, ‘ w4 w2 w1’)         (w4,3)
                                        (w3,2)           (w3,8)
                                        (w4,3)           (w4,7)
   (d8, ‘ w2 w2 w3’)          (w1,3)    (w3,2)
 (d9, ‘w1 w1 w3 w3’)          (w2,3)    (w4,3)
(d10, ‘ w2 w1 w4 w3’)         (w3,4)    (w3,4)
                              (w4,1)    (w4,1)


                M=3 mappers               R=2 reducers
Extra functions
Locality

Move computation near the data: The master tries to
 have a task executed on a worker that is as “near” as
 possible to the input data, thus reducing the
 bandwidth usage
  How does the master know?
Task distribution
The number of tasks is usually higher than the
 number of the available workers
One worker can execute more than one tasks
The balance of work load is improved. In the case
 of a single worker failure there is faster recovery
 and redistribution of tasks to other nodes.
Redundant task executions
Some tasks can be delayed, resulting in a delay in the
 overall work execution
The solution to the problem is the creation of task
 copies that can be executed in parallel from 2 or more
 different workers (speculative execution)
A task is considered complete when the master is
 informed about its completion by at least one node.
Partitioning
A user can specify a custom function that will
 partition the tasks during shuffling.
The type of input and output data can be defined by
 the user and has no limitation on what form it should
 have.
The input of a reducer is always sorted
There is the possibility to execute tasks locally in a
  serial manner
The master provides web interfaces for
  Monitoring tasks progress

  Browsing of HDFS
When should I use it?
Good choice for jobs that can be broken into parallelized jobs:
     Indexing/Analysis of log files

     Sorting of large data sets

     Image processing


•
    Bad choice for serial or low latency jobs:
    –
        Computation of number π with precision of 1,000,000 digits
    –
        Computation of Fibonacci sequence
    –
        Replacing MySQL
Use cases 1/3
             Large Scale Image Conversions
             100 Amazon EC2 Instances, 4TB raw TIFF data
             11 Million PDF in 24 hours and 240$
        •
              Internal log processing
        •
              Reporting, analytics and machine learning
        •
              Cluster of 1110 machines, 8800 cores and 12PB
              raw storage
        •
              Open source contributors (Hive)


        •
              Store and process tweets, logs, etc
        •
              Open source contributors (hadoop-lzo)
        •
              Large scale machine learning
Use cases 2/3
        100.000 CPUs in 25.000 computers

        Content/Ads Optimization, Search index

        Machine learning (e.g. spam filtering)

        Open source contributors (Pig)


       •
           Natural language search (through
           Powerset)
       •
           400 nodes in EC2, storage in S3
       •
           Open source contributors (!) to HBase
       •
           ElasticMapReduce service
       •
           On demand elastic Hadoop clusters for the
           Cloud
Use cases 3/3
           ETL processing, statistics generation
           Advanced algorithms for behavioral
             analysis and targeting
       •
             Used for discovering People you May Know,
             and for other apps
       •
             3X30 node cluster, 16GB RAM and 8TB
             storage
       •
             Leading Chinese language search engine
       •
             Search log analysis, data mining
       •
             300TB per week
       •
             10 to 500 node clusters
Amazon ElasticMapReduce (EMR)
  A hosted Hadoop-as-a-service solution provided by AWS
 No need for management or tuning of Hadoop clusters
     ●
         upload your input data, store your output data on S3
     ●
         procure as many EC2 instances as you need and only pay for the
         time you use them
 Hive and Pig support makes it easy to write data analytical scripts

 Java, Perl, Python, PHP, C++ for more sophisticated algorithms

 Integrates to dynamoDB (process combined datasets in S3 &
  dynamoDB)
 Support for HBase (NoSQL)
Questions

Weitere ähnliche Inhalte

Was ist angesagt?

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Distribution transparency and Distributed transaction
Distribution transparency and Distributed transactionDistribution transparency and Distributed transaction
Distribution transparency and Distributed transactionshraddha mane
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMSkoolkampus
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Database , 12 Reliability
Database , 12 ReliabilityDatabase , 12 Reliability
Database , 12 ReliabilityAli Usman
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Producer consumer
Producer consumerProducer consumer
Producer consumerMohd Tousif
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark InternalsKnoldus Inc.
 
database recovery techniques
database recovery techniques database recovery techniques
database recovery techniques Kalhan Liyanage
 

Was ist angesagt? (20)

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Distribution transparency and Distributed transaction
Distribution transparency and Distributed transactionDistribution transparency and Distributed transaction
Distribution transparency and Distributed transaction
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMS
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Database , 12 Reliability
Database , 12 ReliabilityDatabase , 12 Reliability
Database , 12 Reliability
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Producer consumer
Producer consumerProducer consumer
Producer consumer
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
database recovery techniques
database recovery techniques database recovery techniques
database recovery techniques
 

Ähnlich wie Hadoop & MapReduce (20)

mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
H04502048051
H04502048051H04502048051
H04502048051
 
E031201032036
E031201032036E031201032036
E031201032036
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 

Mehr von Newvewm

Entrepreneur un slideshow v6
Entrepreneur un slideshow v6Entrepreneur un slideshow v6
Entrepreneur un slideshow v6Newvewm
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud OutageNewvewm
 
Newvem's Utilization Heat Map
Newvem's Utilization Heat MapNewvem's Utilization Heat Map
Newvem's Utilization Heat MapNewvewm
 
Hitting Your Cloud’s Usage Sweet Spot
Hitting Your Cloud’s Usage Sweet SpotHitting Your Cloud’s Usage Sweet Spot
Hitting Your Cloud’s Usage Sweet SpotNewvewm
 
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud AdoptionCloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud AdoptionNewvewm
 
Onavo aws summit 2012
Onavo   aws summit 2012Onavo   aws summit 2012
Onavo aws summit 2012Newvewm
 
ClickSoftware AWS Customer Case
ClickSoftware AWS Customer CaseClickSoftware AWS Customer Case
ClickSoftware AWS Customer CaseNewvewm
 
SaaS as a Security Hazard - Google Apps Security Example
SaaS as a Security Hazard - Google Apps Security ExampleSaaS as a Security Hazard - Google Apps Security Example
SaaS as a Security Hazard - Google Apps Security ExampleNewvewm
 
Cloud security management by newvem
Cloud security management by newvemCloud security management by newvem
Cloud security management by newvemNewvewm
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureNewvewm
 
OneHourTranslation - AWS Cloud Case Study
OneHourTranslation - AWS Cloud Case StudyOneHourTranslation - AWS Cloud Case Study
OneHourTranslation - AWS Cloud Case StudyNewvewm
 
Secure Your AWS Cloud Data by Porticor
Secure Your AWS Cloud Data by PorticorSecure Your AWS Cloud Data by Porticor
Secure Your AWS Cloud Data by PorticorNewvewm
 

Mehr von Newvewm (12)

Entrepreneur un slideshow v6
Entrepreneur un slideshow v6Entrepreneur un slideshow v6
Entrepreneur un slideshow v6
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud Outage
 
Newvem's Utilization Heat Map
Newvem's Utilization Heat MapNewvem's Utilization Heat Map
Newvem's Utilization Heat Map
 
Hitting Your Cloud’s Usage Sweet Spot
Hitting Your Cloud’s Usage Sweet SpotHitting Your Cloud’s Usage Sweet Spot
Hitting Your Cloud’s Usage Sweet Spot
 
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud AdoptionCloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
 
Onavo aws summit 2012
Onavo   aws summit 2012Onavo   aws summit 2012
Onavo aws summit 2012
 
ClickSoftware AWS Customer Case
ClickSoftware AWS Customer CaseClickSoftware AWS Customer Case
ClickSoftware AWS Customer Case
 
SaaS as a Security Hazard - Google Apps Security Example
SaaS as a Security Hazard - Google Apps Security ExampleSaaS as a Security Hazard - Google Apps Security Example
SaaS as a Security Hazard - Google Apps Security Example
 
Cloud security management by newvem
Cloud security management by newvemCloud security management by newvem
Cloud security management by newvem
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud Infrastructure
 
OneHourTranslation - AWS Cloud Case Study
OneHourTranslation - AWS Cloud Case StudyOneHourTranslation - AWS Cloud Case Study
OneHourTranslation - AWS Cloud Case Study
 
Secure Your AWS Cloud Data by Porticor
Secure Your AWS Cloud Data by PorticorSecure Your AWS Cloud Data by Porticor
Secure Your AWS Cloud Data by Porticor
 

Kürzlich hochgeladen

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Kürzlich hochgeladen (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Hadoop & MapReduce

  • 1. Hadoop & MapReduce Dr. Ioannis Konstantinou http://www.cslab.ntua.gr/~ikons AWS Usergroup Greece 18/07/2012 Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens
  • 2. Big Data 90% of today's data was created in the last 2 years Moore's law: Data volume doubles every 18 months YouTube: 13 million hours and 700 billion views in 2010 Facebook: 20TB/day (compressed) CERN/LHC: 40TB/day (15PB/year) Many more examples Web logs, presentation files, medical files etc
  • 3. Problem: Data explosion 1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes) Data traffic of mobile telephony in the USA in 2010 1.2 ZB (Zettabyte) = 1200 EB Total of digital data in 2010 35 ZB (Zettabyte = 1021 bytes) Estimate for volume of total digital data in 2020
  • 6. Divide and Conquer “Problem” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine
  • 7. Parallelization challenges  How to assign units of work to the workers?  What if there are more units of work than workers?  What if the workers need to share intermediate incomplete data?  How do we aggregate such intermediate data?  How do we know when all workers have completed their assignments?  What if some workers failed?
  • 8. What is MapReduce? A programming model A programming framework Used to develop solutions that will  Process large amounts of data in a parallelized fashion  In clusters of computing nodes Originally a closed-source implementation at Google  Scientific papers of ’03 & ’04 describe the framework Hadoop: opensource implementation of the algorithms described in the scientific papers  http://hadoop.apache.org/
  • 9. What is Hadoop?  2 large subsystems, 1 for data management & 1 for computation:  HDFS (Hadoop Distributed File System)  MapReduce computation framework runs above HDFS  HDFS is essentially the I/O of Hadoop  Written in java: A set of java processes running in multiple nodes  Who uses it:  Yahoo!  Amazon  Facebook  Twitter  Plus many more...
  • 10. HDFS – distributed file system  A scalable distributed file system for applications dealing with large data sets.  Distributed: runs in a cluster  Scalable: 10Κ nodes, 100Κ files 10PB storage  Storage space is seamless for the whole cluster  Files broken into blocks  Typical block size: 128 MB.  Replication: Each block copied to multiple data nodes.
  • 11. Architecture of HDFS/MapReduce  Master/Slave scheme  HDFS: A central NameNode administers multple DataNodes  NameNode: holds information about which DataNode holds which files  DataNodes: «dummy» servers that hold raw file chunks  MapReduce: A central JobTracker administers multiple TaskTrackers -NameNode and JobTracker They run on the master -DataNode and TaskTracker They run on the slaves
  • 12. MapReduce The problem is broken down in 2 phases. ● Map: Non overlapping sets of data input (<key,value> records) are assigned to different processes (mappers) that produce a set of intermediate <key,value> results ● Reduce: Data of Map phase are fed to a typically smaller number of processes(reducers) that aggregate the input results to a smaller number of <key,value> records.
  • 13. How does it work?
  • 14. Initialization phase Input is uploaded to HDFS and is split into pieces of fixed size Each TaskTracker node that participates in the computation is executing a copy of the MapReduce program One of the nodes plays the JobTracker master role. This node will assign tasks to the rest (workers). Tasks can either be of type map or reduce.
  • 15. JobTracker (Master) The jobTracker holds data about: Status of tasks Location of input, output and intermediate data (runs together with NameNode - HDFS master) The master is responsible for timecheduling of work tasks execution.
  • 16. TaskTracker (Slave) The TaskTracker runs tasks assigned by the master. Runs at the same node as the DataNode (HFDS slave) Task can be either of type Map or type Reduce Typically the maximum number of concurrent tasks that can be run by a node is equal to the number of cpu cores it has (achieving optimal CPU utilization)
  • 17. Map task  A worker (TaskTracker) that has been assigned a map task ● Reads the relevant input data (input split) from HDFS, analyzes the <key, value> pairs and the output is passed as input to the map function. ● The map function processes the pairs and produces intermediate pairs that are aggregated in memory. ● Periodically a partition function is executed which stores the intermediate key- value pairs in the local node storage, while grouping them in R sets.This function is user defined. ● When the partition function completes the storage of the key-value pairs it informs the master that the task is complete and where the data are stored. ● The master forwards this information to the workers that run the reduce tasks
  • 18. Reduce task  A worker that has been assigned a reduce task  Reads from every map process that has been executed the pairs that correspond to itself based on the locations instructed by the master.  When all intermediate pairs have been retrieved they are sorted based on their key. Entries with the same key are grouped together.  Function reduce is executed with input the pairs <key, group_of_values> that were the result of the previous phase.  The reduce task processes the input data and produces the final pairs.  The output pairs are attached in a file in the local file system. When the reduce task is completed the file becomes available in the distributed file system.
  • 19. Task Completion When a worker has completed its task it informs the master. When all workers have informed the master then the master will return the function to the original program of the user.
  • 20. Example Master worker Map Reduce worker Part 1 Part 2 Input worker Map Reduce worker Output Part 3 worker Map Reduce worker
  • 22. Example: Word count 1/3  Objective: measure the frequency of appearance of words in a large set of documents  Potential use case: Discovery of popular url in a set of webserver logfiles  Implementation plan:  “Upload” documents on MapReduce  Author a map function  Author a reduce function  Run a MapReduce task  Retrieve results
  • 23. Example: Word count 2/3 map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)
  • 24. Example: Word count 3/3 (w1, 2) (w1,2) (d1, ‘’w1 w2 w4’) (w2, 3) (w2,3) (d2, ‘ w1 w2 w3 w4’) (w3, 2) (w1,3) (d3, ‘ w2 w3 w4’) (w4,3) (w2,4) (w1,3) (w1,7) (w2,3) (w2,15) (d4, ‘ w1 w2 w3’) (w1,3) (d5, ‘w1 w3 w4’) (w2,4) (d6, ‘ w1 w4 w2 w2) (w3,2) (d7, ‘ w4 w2 w1’) (w4,3) (w3,2) (w3,8) (w4,3) (w4,7) (d8, ‘ w2 w2 w3’) (w1,3) (w3,2) (d9, ‘w1 w1 w3 w3’) (w2,3) (w4,3) (d10, ‘ w2 w1 w4 w3’) (w3,4) (w3,4) (w4,1) (w4,1) M=3 mappers R=2 reducers
  • 26. Locality Move computation near the data: The master tries to have a task executed on a worker that is as “near” as possible to the input data, thus reducing the bandwidth usage How does the master know?
  • 27. Task distribution The number of tasks is usually higher than the number of the available workers One worker can execute more than one tasks The balance of work load is improved. In the case of a single worker failure there is faster recovery and redistribution of tasks to other nodes.
  • 28. Redundant task executions Some tasks can be delayed, resulting in a delay in the overall work execution The solution to the problem is the creation of task copies that can be executed in parallel from 2 or more different workers (speculative execution) A task is considered complete when the master is informed about its completion by at least one node.
  • 29. Partitioning A user can specify a custom function that will partition the tasks during shuffling. The type of input and output data can be defined by the user and has no limitation on what form it should have.
  • 30. The input of a reducer is always sorted There is the possibility to execute tasks locally in a serial manner The master provides web interfaces for Monitoring tasks progress Browsing of HDFS
  • 31. When should I use it? Good choice for jobs that can be broken into parallelized jobs:  Indexing/Analysis of log files  Sorting of large data sets  Image processing • Bad choice for serial or low latency jobs: – Computation of number π with precision of 1,000,000 digits – Computation of Fibonacci sequence – Replacing MySQL
  • 32. Use cases 1/3  Large Scale Image Conversions  100 Amazon EC2 Instances, 4TB raw TIFF data  11 Million PDF in 24 hours and 240$ • Internal log processing • Reporting, analytics and machine learning • Cluster of 1110 machines, 8800 cores and 12PB raw storage • Open source contributors (Hive) • Store and process tweets, logs, etc • Open source contributors (hadoop-lzo) • Large scale machine learning
  • 33. Use cases 2/3  100.000 CPUs in 25.000 computers  Content/Ads Optimization, Search index  Machine learning (e.g. spam filtering)  Open source contributors (Pig) • Natural language search (through Powerset) • 400 nodes in EC2, storage in S3 • Open source contributors (!) to HBase • ElasticMapReduce service • On demand elastic Hadoop clusters for the Cloud
  • 34. Use cases 3/3 ETL processing, statistics generation Advanced algorithms for behavioral analysis and targeting • Used for discovering People you May Know, and for other apps • 3X30 node cluster, 16GB RAM and 8TB storage • Leading Chinese language search engine • Search log analysis, data mining • 300TB per week • 10 to 500 node clusters
  • 35. Amazon ElasticMapReduce (EMR) A hosted Hadoop-as-a-service solution provided by AWS  No need for management or tuning of Hadoop clusters ● upload your input data, store your output data on S3 ● procure as many EC2 instances as you need and only pay for the time you use them  Hive and Pig support makes it easy to write data analytical scripts  Java, Perl, Python, PHP, C++ for more sophisticated algorithms  Integrates to dynamoDB (process combined datasets in S3 & dynamoDB)  Support for HBase (NoSQL)