Dache: A Data Aware Caching for Big-Data using Map Reduce framework

Dache:ADataAwareCachingforBig-DataApplicationsUsing
theMapReduceFramework

 INTRODUCTION
 ABSTRACT
 EXISTING SYSTEM
 PROPOSED SYSTEM
 SYSTEM ARCHITECTURE
 RESULT AND DISCUSSION
 CONCLUSION

INTRODUCTION:
 Google MapReduce: A software framework for
large-scale distributed computing on large
amounts of data.
 Hadoop : An open-source implementation of the
Google MapReduce programming model.
 Two phases:
Map Phase and Reduce Phase.
 Provisioning cache layer for efficiently
identifying and accessing cache items.

EXISTING SYSTEM:
 MapReduce is used for providing a standardized
framework.
 Intermediate data is thrown away since map
reduce is unable to utilize them
LIMITATIONS:
 Inefficiency in incremental processing.
 Duplicate computations being performed.
 Do not have a mechanism to find duplicate
computations and accelerate job execution.

EXISTING SYSTEM(Cont.):
 Input is splitter and feed to workers in map phase.
 Intermediate files generated in the map phase are
shuffled and sorted by the system and fed into
workers in reduce phase.
 Final results are computed
by multiple reducers and
written to disk.
 Input is splitter and feed to workers in map phase.
 Intermediate files generated in the map phase are
shuffled and sorted by the system and fed into
workers in reduce phase.
 Final results are computed
by multiple reducers and
written to disk.

PROPOSED SYSTEM
 Dache - a data aware cache system for
big-data applications using the
MapReduce framework.
 Dache aim-extending the MapReduce
framework and provisioning a cache layer
for efficiently identifying and accessing
cache items in a MapReduce
job.

PROPOSED SYSTEM(Cont.):
 Identifies the source input from which a cache
item is obtained, and the operations applied on the
input, so that a cache item produced by the
workers in the map phase is indexed properly.
 Partition operation applied
in map phase.

MAP REDUCE ARCHITECTURE
 Job Client:
• Submit jobs
 Job Tracker:
• Co-ordinate
jobs
 Task Tracker:
• Execute job
Tasks

MAP REDUCE ARCHITECTURE
1. Clients submits jobs to the Job Tracker.
2. Job Tracker talks to the name node.
3. Job Tracker creates execution plan.
4. Job Tracker submit works to Task tracker.
5. Task Trackers report progress via heart beats.
6. Job Tracker manages the phases.
7. Job Tracker update the status.

MAP PHASE DESCRIPTION PHASE:
 A piece of cached data stored in Distributed File
System(DFS).
 Content of a cache item is described by the original
data and operations applied.
 2 tuple:{Origin, Operation}.
Origin: Name of a file in DFS.
Operation: Linear list of available operations
performed on the origin file.

REDUCE PHASE DESCRIPTION PHASE:
 Input for the reduce phase is also a list of key-value
pairs, where the value could be list of values.
 Original input and the applied operations are
required.
 Original input obtained by storing the intermediate
results of the map phase in the DFS.

Job Types and Cache Organization
Relation:
 When processing each
file split, the cache
manager reports the
previous file splitting
scheme used in its cache
item.

Job Types and Cache Organization
Relation:
 To find words starting
with ‘ab’, We use the
results from the cache for
word starting with ‘a’ ;
and also add it to the
cache
 Find the best match among
overlapped results [choose
‘ab’ instead of ‘a’]

CACHE REQUEST AND REPLY:
MAP CACHE:
 Cache requests must be sent out before the file
splitting phase.
 Job tracker issues cache requests to the cache
manager.
 Cache manager replies a list of cache descriptions.

CACHE REQUEST AND REPLY:
REDUCE CACHE:
 First , compare the requested cache item with the
cached items in the cache manager’s database.
 Cache manager identify the overlaps of the original
input files of the requested cache and stored cache.
 Linear scan is used here.

LIFE TIME MANAGEMENT OF
CACHE ITEM
 Cache Manager:
Determines how much time a cache item can be
kept in DFS.
 Two types of policies:
• Fixed Storage quota:
Least Recent Used(LRU) is employed.
• Optimal utility
Estimates the saved computation time ts by caching cache
item for given amount of time ta.
Expenses ts = P storage x Scache x ts
Save ts = P computation x R duplicate x ts

RESULT AND DISCUSSION:
The graph for CPU utilization of Hadoop and
Dache in the two programs:
 Tera-sort Program.
 Word-count Program.

RESULT AND
DISCUSSION(Cont.):
The graph for Completion time for the two
programs using Dache and Hadoop.

RESULT AND
DISCUSSION(Cont.):
The graph for Total cache size in GB for the two
programs using Dache and Hadoop.

CONCLUSION:
 Requires minimum change to the original MapReduce
programming model.
 Application code only requires slight changes in order to
utilize Dache.
 Implement Dache in Hadoop by extending relevant
components.
 Testbed experiments show that it can eliminate all the
duplicate tasks in incremental MapReduce jobs.
 Minimum execution time and CPU utilization.

REFERENCES:
 J. Dean and S. Ghemawat, MapReduce: Simplified data
processing on large clusters, Communication of ACM, vol.
51, no. 1, pp. 107-113, 2008.
 Hadoop, http://Hadoop.apache.org, 2013.
 Cache algorithms, http://en.wikipedia.org/wiki/Cache
algorithms, 2013.
 Java programming language, http://www.java.com/, 2013.
 Google compute engine,
http://cloud.google.com/products/computeengine.html,
2013.

Dache: A Data Aware Caching for Big-Data using Map Reduce framework

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Dache: A Data Aware Caching for Big-Data using Map Reduce framework

Ähnlich wie Dache: A Data Aware Caching for Big-Data using Map Reduce framework (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Dache: A Data Aware Caching for Big-Data using Map Reduce framework