SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
MapReduce in Cloud Computing

               Mohammad Mustaqeem
                  M.Tech 2nd Year
           Computer Science and Engineering
                Reg. No: 2011CS17




Department of Computer Science and Engineering
 Motilal Nehru National Institute of Technology
                  Allahabad
Contents
1 Introduction                                                                                   1
  1.1   Map and Reduce in Functional Programming . . . . . . . . . . . . . . . . . .             1
  1.2   Structure of MapReduce Framework          . . . . . . . . . . . . . . . . . . . . . .    1

2 Motivations                                                                                   2

3 Description of First Paper                                                                     2
  3.1   Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     2
  3.2   Approach used to Tackle the Issue . . . . . . . . . . . . . . . . . . . . . . . .        3
        3.2.1   Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . .         3
        3.2.2   MapReduce Programming Model . . . . . . . . . . . . . . . . . . . .              4
  3.3   An Example : Word Count . . . . . . . . . . . . . . . . . . . . . . . . . . . .          6

4 Description of Second Paper                                                                    8
  4.1   Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     8
  4.2   Approach used to Tackle the Issue . . . . . . . . . . . . . . . . . . . . . . . .        8
        4.2.1   System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       9
        4.2.2   Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    11
        4.2.3   System Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . .        11
  4.3   Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     13

5 Integration of both Papers                                                                    14

6 Conclusion                                                                                    14
List of Figures
  1   HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   3
  2   Execution phase of a generic MapReduce application . . . . . . . . . . . . .         5
  3   Word Count Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    6
  4   System model described through the UML Class Diagram. . . . . . . . . . .           9
  5   Behaviour of a generic node described by an UML State Diagram. . . . . . .          12
  6   General Architecture of P2P-MapReduce. . . . . . . . . . . . . . . . . . . . .      13




                                            3
1     Introduction
Cloud computing is designed to provide on demand resources or services over the Internet,
usually at the scale and with the reliability level of a data center. MapReduce is a software
framework that allows developers to write programs that process massive amounts of unstruc-
tured data in parallel across a distributed cluster of processors or stand-alone computers. It
was developed at Google for indexing Web pages.
   The model is inspired by the map and reduce functions commonly used in functional
programming(like LISP, scheme, racket etc.),[3] although their purpose in the MapReduce
framework is not the same as their original forms.


1.1    Map and Reduce in Functional Programming

    • Map: The structure of map function in Racket is -
      (map f list1)→ list2    [4]
      where f is a function and, list1 and list2 are lists.
      It applies function f to the elements of list1 and gives a list list2 containing results of f
      in order.
      e.g. (map (lambda (x)(* x x)) ’(1 2 3 4 5))→ ’(1 4 9 16 25)

    • Reduce: There are two variations of Reduce function in Racket.Their structure are -
      (foldl f init list1)→ any
                and
      (foldl f init list1)→ any      [4]

Like map, foldl applies a function to the elements of one or more lists. Whereas map combines
the return values into a list, foldl combines the return values in an arbitrary way that is
determined by f.In foldl, list1 is traversed from left to right while in foldr, list1 is traversed
from right to left.
    e.g. (foldl - 0 ’(1 2 3 4 5 6))→ 3
    (foldr - 0 ’(1 2 3 4 5 6))→ -3


1.2    Structure of MapReduce Framework

The framework is divided into two parts:

    • Map: It distributes out work to different nodes in the distributed cluster.

                                                1
• Reduce: It collects the work and resolves the results into a single value.

    The MapReduce Framework is fault-tolerant because each node in the cluster is expected
to report back periodically with completed work and status updates. If a node remains silent
for longer than the expected interval, a master node makes note and re-assigns the work to
other nodes.


2     Motivations
The computations that process large amount of raw data such as crawled documents, web
request logs etc. to compute various kinds of derived data, such as inverted indices, vari-
ous representations of the graph structure of web documents, summaries of the number of
pages crawled per host, the set of most frequent queries in a given day, etc., are very com-
plex. Most such computations are conceptually straightforward. However, the input data is
usually large and the computations have to be distributed across hundreds or thousands of
machines(cluster) in order to finish in a reasonable amount of time. Most of the time, some
machines may fail during computation. So, we required such a solution that cope well with
these issues.
    MapReduce framework are able to handle these issues like how to parallelize the compu-
tation, distribute the data, and handle failures of various nodes during computation. Beside
these features, writing MapReduce programs is very easy. Programmers have to just define
the two function i.e. map and reduce. Rest of the work is done by the MapReduce framework.


3     Description of First Paper
Gaizhen Yang, ”The Application of MapReduce in the Cloud Computing”


3.1    Issues

In cloud computing, all the commodity hardware need to process enormous amount of data
that can’t be handle by single machine. The real life examples of such processing are
ReverseWeb-Link Graph, web access analysis, Term-Vector per Host, the inverted index
clustering, Count of URL Access Frequency, Distributed Sort etc [3]. Because of size of these
data, we need to process it parallely in distributed manner on large clusters of machine so
that the processing can be done in reasonable amount of time.




                                              2
3.2     Approach used to Tackle the Issue

Hadoop is an open source Java framework for processing and querying vast amounts of
data on large clusters of commodity hardware(cloud) and have been applied in many sites
such as Amazon, Facebook and Yahoo etc. [1]. It takes advantage of distributed system
infrastructure and process enormous amount of data in almost real time. It can also tackle
the node failure because it keep multiple copies of data.
  Hadoop has mainly two components - MapReduce and Hadoop Distributed File System
(HDFS) [1].


3.2.1   Hadoop Distributed File System

HDFS provides the underlying support for distributed storage. Like traditional File System,
we can make, delete, rename the files and directory. But these files and directories are
stored in distributed fashion among the nodes. In HDFS, there are two types of nodes -
Name Node and Data Node [1]. Name Node provides the data services while Data Node
provides actual storage. Hadoop cluster contains only one Name Node and multiple Data
Nodes. In HDFS, files are divided into blocks which are copied to multiple Data Nodes to
provide reliable File System. The HDFS architecture is shown below -




                              Figure 1: HDFS Architecture


   • Name Node - Name Node is a process that runs on separate machine. It provides
     all the data services that is file system management and maintaining the file system
     tree. In reality, Name Node stores only the meta-data of the files and directories.
     While programming, programmer doesn’t need the actual location of the files but it
     can access the files through the Name Node. Name Node does all the underlying work
     for the users.

                                            3
• Data Node - Data Node is a process that runs on individual machines of the cluster.
     The file blocks are stored in the local file system of these nodes. These nodes periodically
     sends the meta-data of the stored blocks to the Name Node. Client can directly writes
     the blocks to the Data Node. After writing, deleting, copying the blocks, the Data
     Nodes informs to the Name Node.

The sequence of operations to write a file in HDFS are -

  1. Client send request to write a file to the Name Node.

  2. According to file size and file block configuration, NameNode returned file information
     of its management section to the Client.

  3. Client divide files into multiple blocks. According to Data Node address information,
     Client writes the blocks to Data Nodes.


3.2.2   MapReduce Programming Model

MapReduce is the key concept behind the Hadoop. It is widely recognized as the most
important programming model for Cloud computing. MapReduce is a technique for dividing
work across a distributed system.
    In MapReduce programming model, users have to define only two functions - a map and
a reduce function.
   The map function processes a (key, value) pair and returns a list of (intermediate key,
value) pairs:
   map (k1, v1) → list(k2, v2).
   The reduce function merges a intermediate values having the same intermediate key:
   reduce (k2, list(v2) → list(v3).
    Execution phase of a generic MapReduce application - Following sequence of
actions occur when the user submits a MapReduce Job:

  1. The MapReduce library in the user program first splits the input files into M pieces.
     The size of these pieces is range from 16 MB to 64 MB. It then starts copying these
     pieces into multiple machines of the cluster. Then all the software program started.

  2. Among these program one is master and others are workers or salves. There are total
     M map tasks and R reduce tasks. Firstly, master picks idle workers and assigns a map
     or reduce task.



                                              4
Figure 2: Execution phase of a generic MapReduce application

  3. Map task reads the contents of the corresponding input splits. It process key-value
     pairs of the input data and passes each pair to the user-defined Map function. The
     intermediate key-value pairs produced are buffered in the memory.

  4. The buffered pairs are written to local disk and the location of these pairs are passed
     back to master. Then master forwards these memory locations to reduce workers.

  5. When reduce worker gets these memory locations, it uses remote procedure calls to
     read data from map worker. After reading all these intermediate pairs, it reduce worker
     sorts it by the intermediate keys so that all the occurrences of the same key are grouped
     together.

  6. For each intermediate key, the user defined Reduce function is applied to the corre-
     sponding intermediate values. Finally, the output of the Reduce function is appended
     to the final output file.

  7. When all map tasks and reduce tasks have been completed, the master wakes up the
     user program. At this point, the MapReduce call in the user program returns back to
     the user code.

After successful execution of these steps, the output is stored in R output files(one per reduce
task).


                                              5
3.3      An Example : Word Count

A simple MapReduce program can be written to determine how many times different words
appear in a set of files.
   Let the content of a file is -
      the quick
      brown fox
      the fox ate
      the mouse
      how now
      brown cow
   Whole MapReduce process is depicted in the given figure -




                              Figure 3: Word Count Execution


  1. The MapReduce library splits the file content into three parts -




                                            6
After splitting the data, it starts up many copies of the program on a cluster of ma-
   chines.

2. Master copies the map task to 3 map worker.
   The code of the map function is like -
   mapper (filename, file-contents):
     for each word in file-contents:
       emit (word, 1)

3. Map function is applied to each slit that generate following intermediate key-value pairs
   -




4. When map worker is done, it reports to the master and gives the memory location of
   the output.

5. When all the mapper task is done, the master starts reducer task on the idle machines
   and gives the memory location from where reduce worker starts copying the interme-
   diate key-value pairs.

6. After receiving all the intermediate key-value pairs, reduce worker sorts these pairs to
   group the pairs on the basis of intermediate keys.

7. At this point, the reduce function is applied to the intermediate key-value pair. The
   pseudocode of Reduce function is -
   reducer (word, values):
     sum = 0
     for each value in values:
       sum = sum + value
     emit (word, sum)

                                            7
8. The final output of the Reduce function is -
      brown, 2
      fox, 2
      how, 1
      now, 1
      the, 3
      ate, 1
      cow, 1
      mouse, 1
      quick, 1


4     Description of Second Paper
Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, ”P2P-MapReduce: Parallel
data processing in dynamic Cloud environments”


4.1     Issues

MapReduce is a development model that allows developers to write programs that process
massive amounts of unstructured data in parallel across a distributed cluster of machines.
In a cloud, the nodes may leave and join at runtime.So, we required such a system that
can handle such conditions. The MapReduce that is discussed so far is based on centralized
architecture and it can’t also tackle with the dynamic infrastructure, in which nodes may
join and leave the network at high rates. This paper describes an adaptive P2P-MapReduce
system that can handle the situation in which master node may fail.


4.2     Approach used to Tackle the Issue

The main goal of P2P-MapReduce is to give such a infrastructure in which nodes may join
and leave the cluster without effecting the MapReduce functionality. This is required because
in cloud environment, there is high levels of churn. To achieve this goal, P2P-MapReduce
adopts a peer-to-peer model in which a wide set of autonomous nodes can act either as master
or as a slave. The master and slave are interchanging to each other dynamically in such a
way that the ratio between the number of masters to the slaves remains constant.
    In P2P-MapReduce, to prevent the lose of computation in case of Master failure, there is
are some backup masters for each masters. The master responsible for a job J, referred to as

                                             8
the primary master for J, dynamically updates the job state on its backup nodes, which are
referred to as the backup masters for J. If at some instant, a primary master fails, its place
is taken by one of its backup masters.


4.2.1     System Model

Here, the system model of P2P-MapReduce describes the characteristics of jobs, tasks, users,
and nodes at abstract level. The UML class diagram is given below:




             Figure 4: System model described through the UML Class Diagram.



   • Job: A job can be modelled as following:

        job = jobId, code, input, output, M, R

        where jobId is a job identifier, code includes the map and reduce functions, input and
        output represent the locations of the input and output data respectively, M and R are
        the number of map tasks and reduce task respectively.

   • Task: A task can be modelled as following:



                                                 9
task = taskId, jobId, type, code, input, output

  where taskId and jobId are task identifier and job identifier respectively, type can be
  either MAP or REDUCE, code represents the map or reduce function(depending on
  the task type), and input & output represents the location of the input & output data
  of the task.

• User: A user is modelled as a pair of the form:

  user = userId, userJobList

  where userId is user identifier and userJobList is the list of jobs submitted by the user.

• Node: A node has following tuples:
  node = nodeId, role, primaryJobList, backupJobList, slaveT askList

  where nodeId represents the node identifier, the node’s role(MASTER or SLAVE) is
  identified by the role tuple, primaryJobList is the list of jobs managed by the node,
  backupJobList is the list of jobs of whom it is acting as a backup Master, slaveTaskList is
  empty if the node’s role is MASTER otherwise it contains the list of (map or reduce)task
  assigned to it.

• PrimaryJobType: It has following tuples:

  primaryJobType = job, userId, jobStatus, jobT askList, backupM asterList

  where job is a job descriptor, userId is the user identifier, jobStatus is the current status
  of job, jobTaskList is the list of tasks contains in job, backupMasterList is the list of
  backup Masters of the primary job.

• JobTaskType: JobTaskType has following tuples:

  jobTaskType = task, slaveId, taskStatus

  where task is a task descriptor, slaveId is the identifier of the slave node responsible for
  the task and taskStatus is current status of the task.

• BackupJobType: The backupJobList contains tuples of a backupJobType defined as:

  backupJobType = job, userId, jobStatus, jobT askList, backupM asterList



                                           10
BackupJobType differs from primaryJobType for the presence of an additional field,
        primaryId, which represents the identifier of the primary master associated with the
        job.

   • slaveTaskType: SlaveTaskType has following tuples:

        slaveTaskType = task, primaryId, taskStatus

        where task is a task descriptor, primaryId is the identifier of the primary master asso-
        ciated with the task, and taskStatus contains its status.


4.2.2     Architecture

There are three types of node in P2P-MapReduce architecture i.e. user, master and slave.
Master nodes and Slave nodes form two two logical peer-to-peer networks referred to as M-net
and S-net, respectively. The composition of the M-net and S-net are changing dynamically
because as earlier described, the role of master node and slave node are interchanging.
User node submits the MapReduce job to one of the available master nodes. The selection
of master node is done by current workload of the available master nodes.
Master nodes are at the core of the system. They perform three types of operations: man-
agement, recovery and coordination. A master node that is acting as primary master for one
or more jobs, executes the management operation. A master node that is acting as backup
master for one or more jobs, executes the recovery operation. The coordinator operation
changes slaves into masters and vice-versa, so as to keep the desired master/slave ratio.
   The slave executes tasks that are assigned to it by one or more primary masters.
    Jobs and tasks are managed by process called Job Managers and Task Managers respec-
tively. For each managed jobs, primary master runs one Job Manager while slave runs one
task Manager for each assigned task. In addition to this, masters also runs Backup Job
Manager for each job they are responsible for a backup masters.


4.2.3     System Mechanism

The behaviour of a generic node can be easily understood by UML state diagram. With the
states, it also gives the events by which the state of the node changes. UML state diagram
of a node in P2P-MapReduce architecture is given below:




                                               11
Figure 5: Behaviour of a generic node described by an UML State Diagram.



    The state diagram shows two macro-states, SLAVE and MASTER, which is the two role
that can a node has. The SLAVE macro-state has three states, IDLE, CHECK MASTER
and ACTIVE, which represent respectively: a slave waiting for task assignment; a slave
checking the existence of at least one master in the network; a slave executing one or more
tasks. The MASTER macro-state is modelled with three parallel macro-states, which rep-
resent the different roles a master can perform concurrently: possibly acting as the primary
master for one or more jobs (MANAGEMENT); possibly acting as a backup master for one
or more jobs (RECOVERY); coordinating the network for maintenance purposes (COORDI-
NATION). The MANAGEMENT macro-state contains two states: NOT PRIMARY, which
represents a master node currently not acting as the primary master for any job, and PRI-
MARY, which, in contrast, represents a master node currently managing at least one job
as the primary master. Similarly, the RECOVERY macro-state includes two states: NOT
BACKUP (the node is not managing any job as backup master) and BACKUP (at least one
job is currently being backed up on this node). Finally, the COORDINATION macro-state
includes four states: NOT COORDINATOR (the node is not acting as the coordinator),
COORDINATOR (the node is acting as the coordinator), WAITING COORDINATOR and
ELECTING COORDINATOR for nodes currently participating to the election of the new
coordinator, as specified later. The combination of the concurrent states [NOT PRIMARY,
NOT BACKUP, NOT COORDINATOR] represents the abstract state MASTER.IDLE. The

                                            12
transition from master to slave role is allowed only to masters in the MASTER.IDLE state.
Similarly, the transition from slave to master role is allowed to slaves that are not in ACTIVE
state.



4.3    Example

Whole system mechanism can be understood by a simple example which is described by
following figure:




                    Figure 6: General Architecture of P2P-MapReduce.

    Figure 6 shows that total three jobs have been submitted: one job by User1(Job1) and
two jobs by User2(Job2 and Job3). For Job1, Node1 is primary master, and Node2 & Node3
are backup masters. Job1 is composed by five tasks: two of them are assigned to Node4, and
one each to Node7, Node9 and Node11.


   The following recovery procedure takes place when a primary master Node1 fails:

   • Backup masters Node2 and Node3 detect the failure of Node1 and start a distributed
     procedure to elect the new primary master among them.

   • Assuming that Node3 is elected as the new primary master, Node2 continues to play
     the backup function and, to keep the desired number of backup masters active (two,

                                              13
in this example), another ackup node is chosen by Node3. Then, Node3 binds to the
     connections that were previously associated with Node1, and proceeds to manage the
     job using its local replica of the job state.

   As soon as the job is completed, the (new) primary master notifies the result to the user
node that submitted the managed job.


5    Integration of both Papers

                       First Paper                Second Paper
 Issues                To perform data-intensive com-
                                                  To design a peer-to-peer MapRe-
                       putation in Cloud environment in
                                                  duce system that can handle all
                       reasonable amount of time. the node’s failure including Mas-
                                                  ter node’s failure.
 Approaches Used Simple MapReduce(presented by Peer-to-peer architecture is used
                 Google) implementation is used. to handle all the dynamic churns
                 MapReduce is based on the in a cluster.
                 Master-Slave Model. This imple-
                 mentation is known as Hadoop.
 Advantages      Hadoop is scalable, reliable and P2P-MapReduce can manage
                 distributed able to handle enor- node churn, master failures and
                 mous amount of data. It can pro- job recovery in an effective way.
                 cess big data in real time.

                         Table 1: Comparison between two papers.



6    Conclusion
MapReduce is a scalable, reliable and exploits the distributed system to perform efficiently in
cloud environment. P2P-MapReduce is a novel approach to handle the real world problems
faced by data-intensive computing. The P2P-MapReduce is more reliable than the MapRe-
duce framework because it is able to manage node churn, master failures, and job recovery in
a decentralized but effective way. Thus, cloud-based programming model will be the future
trends in the programming field.




                                             14
References
[1] Gaizhen Yang, ”The Application of MapReduce in the Cloud Computing”, International
    Symposium on Intelligence Information Processing and Trusted Computing (IPTC), Oc-
    tober 2011, pp. 154-156, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=
    &arnumber=6103560.

[2] Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, ”P2P-MapReduce: Parallel data pro-
    cessing in dynamic Cloud environments”, Journal of Computer and System Sciences,
    vol. 78, Issue 5 September 212, pp. 1382-1402, http://dl.acm.org/citation.cfm?id=
    2240494.

[3] Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large
    clusters, OSDI’04 Proceedings of the 6th conference on Symposium on Opearting
    Systems Design & Implementation, vol. 6, 2004, pp.10-10, www.usenix.org/event/
    osdi04/tech/full_papers/dean/dean.pdf and http://dl.acm.org/citation.cfm?
    id=1251254.1251264.

[4] The Racket Guide http://docs.racket-lang.org/guide/.

[5] Hadoop Tutorial - YDN http://developer.yahoo.com/hadoop/tutorial/module4.
    html.

[6] http://readwrite.com/2012/10/15/why-the-future-of-software-and-apps-is-serverless.

[7] F. Marozzo, D. Talia, P. Trunfio., ”A Peer-to-Peer Framework for Supporting MapReduce
    Applications in Dynamic Cloud Environments”, In: N. Antonopoulos, L. Gillam (eds.),
    Cloud Computing: Principles, Systems and Applications, Springer, Chapter 7, 113-125,
    2010.

[8] IBM developer work, Using MapReduce and load balancing on the cloud http://www.
    ibm.com/developerworks/cloud/library/cl-mapreduce/.




                                          15

Weitere ähnliche Inhalte

Was ist angesagt?

Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database SystemSulemang
 
Levels of Virtualization.docx
Levels of Virtualization.docxLevels of Virtualization.docx
Levels of Virtualization.docxkumari36
 
Implementation levels of virtualization
Implementation levels of virtualizationImplementation levels of virtualization
Implementation levels of virtualizationGokulnath S
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Distributed system architecture
Distributed system architectureDistributed system architecture
Distributed system architectureYisal Khan
 
Distributed operating system(os)
Distributed operating system(os)Distributed operating system(os)
Distributed operating system(os)Dinesh Modak
 
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...Gyanmanjari Institute Of Technology
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed SystemSunita Sahu
 
Distributed & parallel system
Distributed & parallel systemDistributed & parallel system
Distributed & parallel systemManish Singh
 
Unit 1-uses for scripting languages,web scripting
Unit 1-uses for scripting languages,web scriptingUnit 1-uses for scripting languages,web scripting
Unit 1-uses for scripting languages,web scriptingsana mateen
 
Vision of cloud computing
Vision of cloud computingVision of cloud computing
Vision of cloud computinggaurav jain
 
Eucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebulaEucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebulaAmar Myana
 

Was ist angesagt? (20)

6.distributed shared memory
6.distributed shared memory6.distributed shared memory
6.distributed shared memory
 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database System
 
Levels of Virtualization.docx
Levels of Virtualization.docxLevels of Virtualization.docx
Levels of Virtualization.docx
 
Client Server Architecture ppt
Client Server Architecture pptClient Server Architecture ppt
Client Server Architecture ppt
 
Address Spaces Education
Address Spaces EducationAddress Spaces Education
Address Spaces Education
 
Implementation levels of virtualization
Implementation levels of virtualizationImplementation levels of virtualization
Implementation levels of virtualization
 
Distributed System ppt
Distributed System pptDistributed System ppt
Distributed System ppt
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Distributed system architecture
Distributed system architectureDistributed system architecture
Distributed system architecture
 
Distributed operating system(os)
Distributed operating system(os)Distributed operating system(os)
Distributed operating system(os)
 
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
 
Distributed & parallel system
Distributed & parallel systemDistributed & parallel system
Distributed & parallel system
 
Unit 1-uses for scripting languages,web scripting
Unit 1-uses for scripting languages,web scriptingUnit 1-uses for scripting languages,web scripting
Unit 1-uses for scripting languages,web scripting
 
Task programming
Task programmingTask programming
Task programming
 
Vision of cloud computing
Vision of cloud computingVision of cloud computing
Vision of cloud computing
 
CPU Scheduling Algorithms
CPU Scheduling AlgorithmsCPU Scheduling Algorithms
CPU Scheduling Algorithms
 
Eucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebulaEucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebula
 

Ähnlich wie MapReduce in Cloud Computing

HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endthkoch
 
Wise Document Translator Report
Wise Document Translator ReportWise Document Translator Report
Wise Document Translator ReportRaouf KESKES
 
Hadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperHadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperDavid Luan
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Matloff programming on-parallel_machines-2013
Matloff programming on-parallel_machines-2013Matloff programming on-parallel_machines-2013
Matloff programming on-parallel_machines-2013lepas Yikwa
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
A Step Towards A Scalable Dynamic Single Assignment Conversion
A Step Towards A Scalable Dynamic Single Assignment ConversionA Step Towards A Scalable Dynamic Single Assignment Conversion
A Step Towards A Scalable Dynamic Single Assignment ConversionSandra Long
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 

Ähnlich wie MapReduce in Cloud Computing (20)

HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
Wise Document Translator Report
Wise Document Translator ReportWise Document Translator Report
Wise Document Translator Report
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Ashwin_Thesis
Ashwin_ThesisAshwin_Thesis
Ashwin_Thesis
 
E031201032036
E031201032036E031201032036
E031201032036
 
Hadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperHadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaper
 
H04502048051
H04502048051H04502048051
H04502048051
 
paper
paperpaper
paper
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Matloff programming on-parallel_machines-2013
Matloff programming on-parallel_machines-2013Matloff programming on-parallel_machines-2013
Matloff programming on-parallel_machines-2013
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
A Step Towards A Scalable Dynamic Single Assignment Conversion
A Step Towards A Scalable Dynamic Single Assignment ConversionA Step Towards A Scalable Dynamic Single Assignment Conversion
A Step Towards A Scalable Dynamic Single Assignment Conversion
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Data mining of massive datasets
Data mining of massive datasetsData mining of massive datasets
Data mining of massive datasets
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 

MapReduce in Cloud Computing

  • 1. MapReduce in Cloud Computing Mohammad Mustaqeem M.Tech 2nd Year Computer Science and Engineering Reg. No: 2011CS17 Department of Computer Science and Engineering Motilal Nehru National Institute of Technology Allahabad
  • 2. Contents 1 Introduction 1 1.1 Map and Reduce in Functional Programming . . . . . . . . . . . . . . . . . . 1 1.2 Structure of MapReduce Framework . . . . . . . . . . . . . . . . . . . . . . 1 2 Motivations 2 3 Description of First Paper 2 3.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3.2 Approach used to Tackle the Issue . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2.1 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . 3 3.2.2 MapReduce Programming Model . . . . . . . . . . . . . . . . . . . . 4 3.3 An Example : Word Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Description of Second Paper 8 4.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2 Approach used to Tackle the Issue . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2.3 System Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5 Integration of both Papers 14 6 Conclusion 14
  • 3. List of Figures 1 HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Execution phase of a generic MapReduce application . . . . . . . . . . . . . 5 3 Word Count Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 System model described through the UML Class Diagram. . . . . . . . . . . 9 5 Behaviour of a generic node described by an UML State Diagram. . . . . . . 12 6 General Architecture of P2P-MapReduce. . . . . . . . . . . . . . . . . . . . . 13 3
  • 4. 1 Introduction Cloud computing is designed to provide on demand resources or services over the Internet, usually at the scale and with the reliability level of a data center. MapReduce is a software framework that allows developers to write programs that process massive amounts of unstruc- tured data in parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google for indexing Web pages. The model is inspired by the map and reduce functions commonly used in functional programming(like LISP, scheme, racket etc.),[3] although their purpose in the MapReduce framework is not the same as their original forms. 1.1 Map and Reduce in Functional Programming • Map: The structure of map function in Racket is - (map f list1)→ list2 [4] where f is a function and, list1 and list2 are lists. It applies function f to the elements of list1 and gives a list list2 containing results of f in order. e.g. (map (lambda (x)(* x x)) ’(1 2 3 4 5))→ ’(1 4 9 16 25) • Reduce: There are two variations of Reduce function in Racket.Their structure are - (foldl f init list1)→ any and (foldl f init list1)→ any [4] Like map, foldl applies a function to the elements of one or more lists. Whereas map combines the return values into a list, foldl combines the return values in an arbitrary way that is determined by f.In foldl, list1 is traversed from left to right while in foldr, list1 is traversed from right to left. e.g. (foldl - 0 ’(1 2 3 4 5 6))→ 3 (foldr - 0 ’(1 2 3 4 5 6))→ -3 1.2 Structure of MapReduce Framework The framework is divided into two parts: • Map: It distributes out work to different nodes in the distributed cluster. 1
  • 5. • Reduce: It collects the work and resolves the results into a single value. The MapReduce Framework is fault-tolerant because each node in the cluster is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and re-assigns the work to other nodes. 2 Motivations The computations that process large amount of raw data such as crawled documents, web request logs etc. to compute various kinds of derived data, such as inverted indices, vari- ous representations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc., are very com- plex. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines(cluster) in order to finish in a reasonable amount of time. Most of the time, some machines may fail during computation. So, we required such a solution that cope well with these issues. MapReduce framework are able to handle these issues like how to parallelize the compu- tation, distribute the data, and handle failures of various nodes during computation. Beside these features, writing MapReduce programs is very easy. Programmers have to just define the two function i.e. map and reduce. Rest of the work is done by the MapReduce framework. 3 Description of First Paper Gaizhen Yang, ”The Application of MapReduce in the Cloud Computing” 3.1 Issues In cloud computing, all the commodity hardware need to process enormous amount of data that can’t be handle by single machine. The real life examples of such processing are ReverseWeb-Link Graph, web access analysis, Term-Vector per Host, the inverted index clustering, Count of URL Access Frequency, Distributed Sort etc [3]. Because of size of these data, we need to process it parallely in distributed manner on large clusters of machine so that the processing can be done in reasonable amount of time. 2
  • 6. 3.2 Approach used to Tackle the Issue Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware(cloud) and have been applied in many sites such as Amazon, Facebook and Yahoo etc. [1]. It takes advantage of distributed system infrastructure and process enormous amount of data in almost real time. It can also tackle the node failure because it keep multiple copies of data. Hadoop has mainly two components - MapReduce and Hadoop Distributed File System (HDFS) [1]. 3.2.1 Hadoop Distributed File System HDFS provides the underlying support for distributed storage. Like traditional File System, we can make, delete, rename the files and directory. But these files and directories are stored in distributed fashion among the nodes. In HDFS, there are two types of nodes - Name Node and Data Node [1]. Name Node provides the data services while Data Node provides actual storage. Hadoop cluster contains only one Name Node and multiple Data Nodes. In HDFS, files are divided into blocks which are copied to multiple Data Nodes to provide reliable File System. The HDFS architecture is shown below - Figure 1: HDFS Architecture • Name Node - Name Node is a process that runs on separate machine. It provides all the data services that is file system management and maintaining the file system tree. In reality, Name Node stores only the meta-data of the files and directories. While programming, programmer doesn’t need the actual location of the files but it can access the files through the Name Node. Name Node does all the underlying work for the users. 3
  • 7. • Data Node - Data Node is a process that runs on individual machines of the cluster. The file blocks are stored in the local file system of these nodes. These nodes periodically sends the meta-data of the stored blocks to the Name Node. Client can directly writes the blocks to the Data Node. After writing, deleting, copying the blocks, the Data Nodes informs to the Name Node. The sequence of operations to write a file in HDFS are - 1. Client send request to write a file to the Name Node. 2. According to file size and file block configuration, NameNode returned file information of its management section to the Client. 3. Client divide files into multiple blocks. According to Data Node address information, Client writes the blocks to Data Nodes. 3.2.2 MapReduce Programming Model MapReduce is the key concept behind the Hadoop. It is widely recognized as the most important programming model for Cloud computing. MapReduce is a technique for dividing work across a distributed system. In MapReduce programming model, users have to define only two functions - a map and a reduce function. The map function processes a (key, value) pair and returns a list of (intermediate key, value) pairs: map (k1, v1) → list(k2, v2). The reduce function merges a intermediate values having the same intermediate key: reduce (k2, list(v2) → list(v3). Execution phase of a generic MapReduce application - Following sequence of actions occur when the user submits a MapReduce Job: 1. The MapReduce library in the user program first splits the input files into M pieces. The size of these pieces is range from 16 MB to 64 MB. It then starts copying these pieces into multiple machines of the cluster. Then all the software program started. 2. Among these program one is master and others are workers or salves. There are total M map tasks and R reduce tasks. Firstly, master picks idle workers and assigns a map or reduce task. 4
  • 8. Figure 2: Execution phase of a generic MapReduce application 3. Map task reads the contents of the corresponding input splits. It process key-value pairs of the input data and passes each pair to the user-defined Map function. The intermediate key-value pairs produced are buffered in the memory. 4. The buffered pairs are written to local disk and the location of these pairs are passed back to master. Then master forwards these memory locations to reduce workers. 5. When reduce worker gets these memory locations, it uses remote procedure calls to read data from map worker. After reading all these intermediate pairs, it reduce worker sorts it by the intermediate keys so that all the occurrences of the same key are grouped together. 6. For each intermediate key, the user defined Reduce function is applied to the corre- sponding intermediate values. Finally, the output of the Reduce function is appended to the final output file. 7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. After successful execution of these steps, the output is stored in R output files(one per reduce task). 5
  • 9. 3.3 An Example : Word Count A simple MapReduce program can be written to determine how many times different words appear in a set of files. Let the content of a file is - the quick brown fox the fox ate the mouse how now brown cow Whole MapReduce process is depicted in the given figure - Figure 3: Word Count Execution 1. The MapReduce library splits the file content into three parts - 6
  • 10. After splitting the data, it starts up many copies of the program on a cluster of ma- chines. 2. Master copies the map task to 3 map worker. The code of the map function is like - mapper (filename, file-contents): for each word in file-contents: emit (word, 1) 3. Map function is applied to each slit that generate following intermediate key-value pairs - 4. When map worker is done, it reports to the master and gives the memory location of the output. 5. When all the mapper task is done, the master starts reducer task on the idle machines and gives the memory location from where reduce worker starts copying the interme- diate key-value pairs. 6. After receiving all the intermediate key-value pairs, reduce worker sorts these pairs to group the pairs on the basis of intermediate keys. 7. At this point, the reduce function is applied to the intermediate key-value pair. The pseudocode of Reduce function is - reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) 7
  • 11. 8. The final output of the Reduce function is - brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 4 Description of Second Paper Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, ”P2P-MapReduce: Parallel data processing in dynamic Cloud environments” 4.1 Issues MapReduce is a development model that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of machines. In a cloud, the nodes may leave and join at runtime.So, we required such a system that can handle such conditions. The MapReduce that is discussed so far is based on centralized architecture and it can’t also tackle with the dynamic infrastructure, in which nodes may join and leave the network at high rates. This paper describes an adaptive P2P-MapReduce system that can handle the situation in which master node may fail. 4.2 Approach used to Tackle the Issue The main goal of P2P-MapReduce is to give such a infrastructure in which nodes may join and leave the cluster without effecting the MapReduce functionality. This is required because in cloud environment, there is high levels of churn. To achieve this goal, P2P-MapReduce adopts a peer-to-peer model in which a wide set of autonomous nodes can act either as master or as a slave. The master and slave are interchanging to each other dynamically in such a way that the ratio between the number of masters to the slaves remains constant. In P2P-MapReduce, to prevent the lose of computation in case of Master failure, there is are some backup masters for each masters. The master responsible for a job J, referred to as 8
  • 12. the primary master for J, dynamically updates the job state on its backup nodes, which are referred to as the backup masters for J. If at some instant, a primary master fails, its place is taken by one of its backup masters. 4.2.1 System Model Here, the system model of P2P-MapReduce describes the characteristics of jobs, tasks, users, and nodes at abstract level. The UML class diagram is given below: Figure 4: System model described through the UML Class Diagram. • Job: A job can be modelled as following: job = jobId, code, input, output, M, R where jobId is a job identifier, code includes the map and reduce functions, input and output represent the locations of the input and output data respectively, M and R are the number of map tasks and reduce task respectively. • Task: A task can be modelled as following: 9
  • 13. task = taskId, jobId, type, code, input, output where taskId and jobId are task identifier and job identifier respectively, type can be either MAP or REDUCE, code represents the map or reduce function(depending on the task type), and input & output represents the location of the input & output data of the task. • User: A user is modelled as a pair of the form: user = userId, userJobList where userId is user identifier and userJobList is the list of jobs submitted by the user. • Node: A node has following tuples: node = nodeId, role, primaryJobList, backupJobList, slaveT askList where nodeId represents the node identifier, the node’s role(MASTER or SLAVE) is identified by the role tuple, primaryJobList is the list of jobs managed by the node, backupJobList is the list of jobs of whom it is acting as a backup Master, slaveTaskList is empty if the node’s role is MASTER otherwise it contains the list of (map or reduce)task assigned to it. • PrimaryJobType: It has following tuples: primaryJobType = job, userId, jobStatus, jobT askList, backupM asterList where job is a job descriptor, userId is the user identifier, jobStatus is the current status of job, jobTaskList is the list of tasks contains in job, backupMasterList is the list of backup Masters of the primary job. • JobTaskType: JobTaskType has following tuples: jobTaskType = task, slaveId, taskStatus where task is a task descriptor, slaveId is the identifier of the slave node responsible for the task and taskStatus is current status of the task. • BackupJobType: The backupJobList contains tuples of a backupJobType defined as: backupJobType = job, userId, jobStatus, jobT askList, backupM asterList 10
  • 14. BackupJobType differs from primaryJobType for the presence of an additional field, primaryId, which represents the identifier of the primary master associated with the job. • slaveTaskType: SlaveTaskType has following tuples: slaveTaskType = task, primaryId, taskStatus where task is a task descriptor, primaryId is the identifier of the primary master asso- ciated with the task, and taskStatus contains its status. 4.2.2 Architecture There are three types of node in P2P-MapReduce architecture i.e. user, master and slave. Master nodes and Slave nodes form two two logical peer-to-peer networks referred to as M-net and S-net, respectively. The composition of the M-net and S-net are changing dynamically because as earlier described, the role of master node and slave node are interchanging. User node submits the MapReduce job to one of the available master nodes. The selection of master node is done by current workload of the available master nodes. Master nodes are at the core of the system. They perform three types of operations: man- agement, recovery and coordination. A master node that is acting as primary master for one or more jobs, executes the management operation. A master node that is acting as backup master for one or more jobs, executes the recovery operation. The coordinator operation changes slaves into masters and vice-versa, so as to keep the desired master/slave ratio. The slave executes tasks that are assigned to it by one or more primary masters. Jobs and tasks are managed by process called Job Managers and Task Managers respec- tively. For each managed jobs, primary master runs one Job Manager while slave runs one task Manager for each assigned task. In addition to this, masters also runs Backup Job Manager for each job they are responsible for a backup masters. 4.2.3 System Mechanism The behaviour of a generic node can be easily understood by UML state diagram. With the states, it also gives the events by which the state of the node changes. UML state diagram of a node in P2P-MapReduce architecture is given below: 11
  • 15. Figure 5: Behaviour of a generic node described by an UML State Diagram. The state diagram shows two macro-states, SLAVE and MASTER, which is the two role that can a node has. The SLAVE macro-state has three states, IDLE, CHECK MASTER and ACTIVE, which represent respectively: a slave waiting for task assignment; a slave checking the existence of at least one master in the network; a slave executing one or more tasks. The MASTER macro-state is modelled with three parallel macro-states, which rep- resent the different roles a master can perform concurrently: possibly acting as the primary master for one or more jobs (MANAGEMENT); possibly acting as a backup master for one or more jobs (RECOVERY); coordinating the network for maintenance purposes (COORDI- NATION). The MANAGEMENT macro-state contains two states: NOT PRIMARY, which represents a master node currently not acting as the primary master for any job, and PRI- MARY, which, in contrast, represents a master node currently managing at least one job as the primary master. Similarly, the RECOVERY macro-state includes two states: NOT BACKUP (the node is not managing any job as backup master) and BACKUP (at least one job is currently being backed up on this node). Finally, the COORDINATION macro-state includes four states: NOT COORDINATOR (the node is not acting as the coordinator), COORDINATOR (the node is acting as the coordinator), WAITING COORDINATOR and ELECTING COORDINATOR for nodes currently participating to the election of the new coordinator, as specified later. The combination of the concurrent states [NOT PRIMARY, NOT BACKUP, NOT COORDINATOR] represents the abstract state MASTER.IDLE. The 12
  • 16. transition from master to slave role is allowed only to masters in the MASTER.IDLE state. Similarly, the transition from slave to master role is allowed to slaves that are not in ACTIVE state. 4.3 Example Whole system mechanism can be understood by a simple example which is described by following figure: Figure 6: General Architecture of P2P-MapReduce. Figure 6 shows that total three jobs have been submitted: one job by User1(Job1) and two jobs by User2(Job2 and Job3). For Job1, Node1 is primary master, and Node2 & Node3 are backup masters. Job1 is composed by five tasks: two of them are assigned to Node4, and one each to Node7, Node9 and Node11. The following recovery procedure takes place when a primary master Node1 fails: • Backup masters Node2 and Node3 detect the failure of Node1 and start a distributed procedure to elect the new primary master among them. • Assuming that Node3 is elected as the new primary master, Node2 continues to play the backup function and, to keep the desired number of backup masters active (two, 13
  • 17. in this example), another ackup node is chosen by Node3. Then, Node3 binds to the connections that were previously associated with Node1, and proceeds to manage the job using its local replica of the job state. As soon as the job is completed, the (new) primary master notifies the result to the user node that submitted the managed job. 5 Integration of both Papers First Paper Second Paper Issues To perform data-intensive com- To design a peer-to-peer MapRe- putation in Cloud environment in duce system that can handle all reasonable amount of time. the node’s failure including Mas- ter node’s failure. Approaches Used Simple MapReduce(presented by Peer-to-peer architecture is used Google) implementation is used. to handle all the dynamic churns MapReduce is based on the in a cluster. Master-Slave Model. This imple- mentation is known as Hadoop. Advantages Hadoop is scalable, reliable and P2P-MapReduce can manage distributed able to handle enor- node churn, master failures and mous amount of data. It can pro- job recovery in an effective way. cess big data in real time. Table 1: Comparison between two papers. 6 Conclusion MapReduce is a scalable, reliable and exploits the distributed system to perform efficiently in cloud environment. P2P-MapReduce is a novel approach to handle the real world problems faced by data-intensive computing. The P2P-MapReduce is more reliable than the MapRe- duce framework because it is able to manage node churn, master failures, and job recovery in a decentralized but effective way. Thus, cloud-based programming model will be the future trends in the programming field. 14
  • 18. References [1] Gaizhen Yang, ”The Application of MapReduce in the Cloud Computing”, International Symposium on Intelligence Information Processing and Trusted Computing (IPTC), Oc- tober 2011, pp. 154-156, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp= &arnumber=6103560. [2] Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, ”P2P-MapReduce: Parallel data pro- cessing in dynamic Cloud environments”, Journal of Computer and System Sciences, vol. 78, Issue 5 September 212, pp. 1382-1402, http://dl.acm.org/citation.cfm?id= 2240494. [3] Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, OSDI’04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, vol. 6, 2004, pp.10-10, www.usenix.org/event/ osdi04/tech/full_papers/dean/dean.pdf and http://dl.acm.org/citation.cfm? id=1251254.1251264. [4] The Racket Guide http://docs.racket-lang.org/guide/. [5] Hadoop Tutorial - YDN http://developer.yahoo.com/hadoop/tutorial/module4. html. [6] http://readwrite.com/2012/10/15/why-the-future-of-software-and-apps-is-serverless. [7] F. Marozzo, D. Talia, P. Trunfio., ”A Peer-to-Peer Framework for Supporting MapReduce Applications in Dynamic Cloud Environments”, In: N. Antonopoulos, L. Gillam (eds.), Cloud Computing: Principles, Systems and Applications, Springer, Chapter 7, 113-125, 2010. [8] IBM developer work, Using MapReduce and load balancing on the cloud http://www. ibm.com/developerworks/cloud/library/cl-mapreduce/. 15