Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework
1. Jithin Raveendran
S7 IT
Roll No : 31
Guided by :
Prof. Remesh Babu
Presented by :
1
2. BigData???
Buzz-word big-data : large-scale
distributed data processing applications
that operate on exceptionally large
amounts of data.
2.5 Zettabytes of data/day — so much
that 90% of the data in the world today
has been created in the last two years
alone.
2
4. Case study with Hadoop MapReduce
Hadoop:
Open-source software framework from Apache - Distributed processing
of large data sets across clusters of commodity servers.
Designed to scale up from a single server to thousands of machines,
with a very high degree of fault tolerance.
Inspired by
Google MapReduce
GFS (Google File System)
HDFS
Map/Reduce
4
5. Apache Hadoop has two pillars
• HDFS
• Self healing
• High band width clustered
storage
• MapReduce
• Retrieval System
• Maper function tells the cluster
which data points we want to
retrieve
• Reducer function then take all
the data and aggregate
5
7. Name Node:
HDFS - Architecture
Center piece of an HDFS file system
Client applications talk to the NameNode whenever they wish to locate a
file, or when they want to add/copy/move/delete a file.
Responds the successful requests by returning a list of relevant DataNode
servers where the data lives.
Data Node:
Stores data in the Hadoop File System.
A functional file system has more than one Data Node, with data replicated
across them.
7
8. Secondary Name node:
Act as a check point of name node
Takes the snapshot of the Name node and use it whenever the back
up is needed
HDFS Features:
Rack awareness
Reliable storage
High throughput
HDFS - Architecture
8
10. MapReduce Architecture
1. Clients submits jobs to the Job Tracker
2. Job Tracker talks to the name node
3. Job Tracker creates execution plan
4. Job Tracker submit works to Task tracker
5. Task Trackers report progress via heart beats
6. Job Tracker manages the phases
7. Job Tracker update the status
10
11. Current System :
MapReduce is used for
providing a standardized
framework.
Limitation
Inefficiency in
incremental processing.
11
12. Proposed System
Dache - a data aware cache system for
big-data applications using the
MapReduce framework.
Dache aim-extending the MapReduce
framework and provisioning a cache
layer for efficiently identifying and
accessing cache items in a MapReduce
job.
12
13. Related Work
Google Big table - handle incremental processing
Google Percolator - incremental processing platform
Ramcloud - distributed computing platform-Data on RAM
13
14. Technical challenges need to be
addressed
Cache description scheme:
Data-aware caching requires each data object to be indexed by its content.
Provide a customizable indexing that enables the applications to describe
their operations and the content of their generated partial results. This is a
nontrivial task.
Cache request and reply protocol:
The size of the aggregated intermediate data can be very large. When such
data is requested by other worker nodes, determining how to transport
this data becomes complex
14
15. Cache Description
Map phase cache description scheme
Cache refers to the intermediate data produced by worker nodes/processes
during the execution of a MapReduce task.
A piece of cached data stored in a Distributed File System (DFS).
Content of a cache item is described by the original data and the operations
applied.
2-tuple: {Origin, Operation}
Origin : Name of a file in the DFS.
Operation : Linear list of available operations performed on the Origin file
15
16. Cache Description
Reduce phase cache description scheme
The input for the reduce phase is also a list of key-value pairs, where the
value could be a list of values.
Original input and the applied operations are required.
Original input obtained by storing the intermediate results of the map
phase in the DFS.
16
17. Protocol
Relationship between job types and cache
organization
• When processing each file split, the
cache manager reports the previous
file splitting scheme used in its
cache item.
17
18. Protocol
Relationship between job types and cache
organization
To find words starting with
‘ab’, We use the results from
the cache for word starting
with ‘a’ ; and also add it to
the cache
Find the best match among
overlapped results [choose
‘ab’ instead of ‘a’]
18
19. Protocol
Cache item submission
Mapper and reducer nodes/processes record cache items into their
local storage space
Cache item should be put on the same machine as the worker
process that generates it.
Worker node/process contacts the cache manager each time before
it begins processing an input data file.
Worker process receives the tentative description and fetches the
cache item.
19
20. Lifetime management of cache item
Cache manager - Determine how much time a cache item can be
kept in the DFS.
Two types of policies for determining the lifetime of a cache item
1. Fixed storage quota
• Least Recent Used (LRU) is employed
2. Optimal utility
• Estimates the saved computation
time, ts, by caching a cache item for a
given amount of time, ta.
• ts,ta - used to derive the monetary
gain and cost. 20
21. Cache request and reply
Map Cache:
Cache requests must be sent out before the file splitting phase.
Job tracker issues cache requests to the cache manager.
Cache manager replies a list of cache descriptions.
Reduce Cache :
• First , compare the requested cache item with the cached items in
the cache manager’s database.
• Cache manager identify the overlaps of the original input files of
the requested cache and stored cache.
• Linear scan is used here.
21
22. Performance Evaluation
Implementation
Extend Hadoop to implement Dache by changing the components that
are open to application developers.
The cache manager is implemented as an independent server.
22
23. Experiment settings
Hadoop is run in pseudo-distributed mode on a server that has
8-core CPU
core running at 3 GHz,
16GB memory,
a SATA disk
Two applications to benchmark the speedup of Dache over Hadoop
word-count and tera-sort.
23
27. Conclusion
Requires minimum change to the original MapReduce programming
model.
Application code only requires slight changes in order to utilize Dache.
Implement Dache in Hadoop by extending relevant components.
Testbed experiments show that it can eliminate all the duplicate tasks in
incremental MapReduce jobs.
Minimum execution time and CPU utilization.
27
28. Future Work
This scheme utilizes much amount of cache.
Better cache management system will be needed.
28
As You Can See…Can You Imagine How Big It Wud Become If We Take Statitics of 1 DAY
It Works as in Master- Salve ModelNameNod- Details abt Which data belongs to which Datanodes, How many copies
Whenever We put data on hdfs, it breaks up into several pieces.Rack Awarenes- Ensures Multiple copies of data is on multiple DataNodes of Multiple Racks.
Incremental processing
refers to the applications that incrementally grow
the input data and continuously apply computations
on the input in order to generate output