SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Dache:ADataAwareCachingforBig-DataApplicationsUsing
theMapReduceFramework
 INTRODUCTION
 ABSTRACT
 EXISTING SYSTEM
 PROPOSED SYSTEM
 SYSTEM ARCHITECTURE
 RESULT AND DISCUSSION
 CONCLUSION
INTRODUCTION:
 Google MapReduce: A software framework for
large-scale distributed computing on large
amounts of data.
 Hadoop : An open-source implementation of the
Google MapReduce programming model.
 Two phases:
Map Phase and Reduce Phase.
 Provisioning cache layer for efficiently
identifying and accessing cache items.
EXISTING SYSTEM:
 MapReduce is used for providing a standardized
framework.
 Intermediate data is thrown away since map
reduce is unable to utilize them
LIMITATIONS:
 Inefficiency in incremental processing.
 Duplicate computations being performed.
 Do not have a mechanism to find duplicate
computations and accelerate job execution.
EXISTING SYSTEM(Cont.):
 Input is splitter and feed to workers in map phase.
 Intermediate files generated in the map phase are
shuffled and sorted by the system and fed into
workers in reduce phase.
 Final results are computed
by multiple reducers and
written to disk.
 Input is splitter and feed to workers in map phase.
 Intermediate files generated in the map phase are
shuffled and sorted by the system and fed into
workers in reduce phase.
 Final results are computed
by multiple reducers and
written to disk.
PROPOSED SYSTEM
 Dache - a data aware cache system for
big-data applications using the
MapReduce framework.
 Dache aim-extending the MapReduce
framework and provisioning a cache layer
for efficiently identifying and accessing
cache items in a MapReduce
job.
PROPOSED SYSTEM(Cont.):
 Identifies the source input from which a cache
item is obtained, and the operations applied on the
input, so that a cache item produced by the
workers in the map phase is indexed properly.
 Partition operation applied
in map phase.
MAP REDUCE ARCHITECTURE
 Job Client:
• Submit jobs
 Job Tracker:
• Co-ordinate
jobs
 Task Tracker:
• Execute job
Tasks
MAP REDUCE ARCHITECTURE
1. Clients submits jobs to the Job Tracker.
2. Job Tracker talks to the name node.
3. Job Tracker creates execution plan.
4. Job Tracker submit works to Task tracker.
5. Task Trackers report progress via heart beats.
6. Job Tracker manages the phases.
7. Job Tracker update the status.
MAP PHASE DESCRIPTION PHASE:
 A piece of cached data stored in Distributed File
System(DFS).
 Content of a cache item is described by the original
data and operations applied.
 2 tuple:{Origin, Operation}.
Origin: Name of a file in DFS.
Operation: Linear list of available operations
performed on the origin file.
REDUCE PHASE DESCRIPTION PHASE:
 Input for the reduce phase is also a list of key-value
pairs, where the value could be list of values.
 Original input and the applied operations are
required.
 Original input obtained by storing the intermediate
results of the map phase in the DFS.
Job Types and Cache Organization
Relation:
 When processing each
file split, the cache
manager reports the
previous file splitting
scheme used in its cache
item.
Job Types and Cache Organization
Relation:
 To find words starting
with ‘ab’, We use the
results from the cache for
word starting with ‘a’ ;
and also add it to the
cache
 Find the best match among
overlapped results [choose
‘ab’ instead of ‘a’]
CACHE REQUEST AND REPLY:
MAP CACHE:
 Cache requests must be sent out before the file
splitting phase.
 Job tracker issues cache requests to the cache
manager.
 Cache manager replies a list of cache descriptions.
CACHE REQUEST AND REPLY:
REDUCE CACHE:
 First , compare the requested cache item with the
cached items in the cache manager’s database.
 Cache manager identify the overlaps of the original
input files of the requested cache and stored cache.
 Linear scan is used here.
LIFE TIME MANAGEMENT OF
CACHE ITEM
 Cache Manager:
Determines how much time a cache item can be
kept in DFS.
 Two types of policies:
• Fixed Storage quota:
Least Recent Used(LRU) is employed.
• Optimal utility
Estimates the saved computation time ts by caching cache
item for given amount of time ta.
Expenses ts = P storage x Scache x ts
Save ts = P computation x R duplicate x ts
RESULT AND DISCUSSION:
The graph for CPU utilization of Hadoop and
Dache in the two programs:
 Tera-sort Program.
 Word-count Program.
RESULT AND
DISCUSSION(Cont.):
The graph for Completion time for the two
programs using Dache and Hadoop.
 Tera-sort Program.
 Word-count Program.
RESULT AND
DISCUSSION(Cont.):
The graph for Total cache size in GB for the two
programs using Dache and Hadoop.
 Tera-sort Program.
 Word-count Program.
CONCLUSION:
 Requires minimum change to the original MapReduce
programming model.
 Application code only requires slight changes in order to
utilize Dache.
 Implement Dache in Hadoop by extending relevant
components.
 Testbed experiments show that it can eliminate all the
duplicate tasks in incremental MapReduce jobs.
 Minimum execution time and CPU utilization.
REFERENCES:
 J. Dean and S. Ghemawat, MapReduce: Simplified data
processing on large clusters, Communication of ACM, vol.
51, no. 1, pp. 107-113, 2008.
 Hadoop, http://Hadoop.apache.org, 2013.
 Cache algorithms, http://en.wikipedia.org/wiki/Cache
algorithms, 2013.
 Java programming language, http://www.java.com/, 2013.
 Google compute engine,
http://cloud.google.com/products/computeengine.html,
2013.
Any Questions???

Weitere ähnliche Inhalte

Was ist angesagt?

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
AMIT BORUDE
 

Was ist angesagt? (20)

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
 

Andere mochten auch (8)

System analysis
System analysisSystem analysis
System analysis
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Ähnlich wie Dache: A Data Aware Caching for Big-Data using Map Reduce framework

Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
Indhujeni
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 

Ähnlich wie Dache: A Data Aware Caching for Big-Data using Map Reduce framework (20)

Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
Hadoop
HadoopHadoop
Hadoop
 
E031201032036
E031201032036E031201032036
E031201032036
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Data Science
Data ScienceData Science
Data Science
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Unit 2
Unit 2Unit 2
Unit 2
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
hadoop
hadoophadoop
hadoop
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Dache: A Data Aware Caching for Big-Data using Map Reduce framework

  • 2.  INTRODUCTION  ABSTRACT  EXISTING SYSTEM  PROPOSED SYSTEM  SYSTEM ARCHITECTURE  RESULT AND DISCUSSION  CONCLUSION
  • 3. INTRODUCTION:  Google MapReduce: A software framework for large-scale distributed computing on large amounts of data.  Hadoop : An open-source implementation of the Google MapReduce programming model.  Two phases: Map Phase and Reduce Phase.  Provisioning cache layer for efficiently identifying and accessing cache items.
  • 4. EXISTING SYSTEM:  MapReduce is used for providing a standardized framework.  Intermediate data is thrown away since map reduce is unable to utilize them LIMITATIONS:  Inefficiency in incremental processing.  Duplicate computations being performed.  Do not have a mechanism to find duplicate computations and accelerate job execution.
  • 5. EXISTING SYSTEM(Cont.):  Input is splitter and feed to workers in map phase.  Intermediate files generated in the map phase are shuffled and sorted by the system and fed into workers in reduce phase.  Final results are computed by multiple reducers and written to disk.  Input is splitter and feed to workers in map phase.  Intermediate files generated in the map phase are shuffled and sorted by the system and fed into workers in reduce phase.  Final results are computed by multiple reducers and written to disk.
  • 6. PROPOSED SYSTEM  Dache - a data aware cache system for big-data applications using the MapReduce framework.  Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
  • 7. PROPOSED SYSTEM(Cont.):  Identifies the source input from which a cache item is obtained, and the operations applied on the input, so that a cache item produced by the workers in the map phase is indexed properly.  Partition operation applied in map phase.
  • 8. MAP REDUCE ARCHITECTURE  Job Client: • Submit jobs  Job Tracker: • Co-ordinate jobs  Task Tracker: • Execute job Tasks
  • 9. MAP REDUCE ARCHITECTURE 1. Clients submits jobs to the Job Tracker. 2. Job Tracker talks to the name node. 3. Job Tracker creates execution plan. 4. Job Tracker submit works to Task tracker. 5. Task Trackers report progress via heart beats. 6. Job Tracker manages the phases. 7. Job Tracker update the status.
  • 10. MAP PHASE DESCRIPTION PHASE:  A piece of cached data stored in Distributed File System(DFS).  Content of a cache item is described by the original data and operations applied.  2 tuple:{Origin, Operation}. Origin: Name of a file in DFS. Operation: Linear list of available operations performed on the origin file.
  • 11. REDUCE PHASE DESCRIPTION PHASE:  Input for the reduce phase is also a list of key-value pairs, where the value could be list of values.  Original input and the applied operations are required.  Original input obtained by storing the intermediate results of the map phase in the DFS.
  • 12. Job Types and Cache Organization Relation:  When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item.
  • 13. Job Types and Cache Organization Relation:  To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache  Find the best match among overlapped results [choose ‘ab’ instead of ‘a’]
  • 14. CACHE REQUEST AND REPLY: MAP CACHE:  Cache requests must be sent out before the file splitting phase.  Job tracker issues cache requests to the cache manager.  Cache manager replies a list of cache descriptions.
  • 15. CACHE REQUEST AND REPLY: REDUCE CACHE:  First , compare the requested cache item with the cached items in the cache manager’s database.  Cache manager identify the overlaps of the original input files of the requested cache and stored cache.  Linear scan is used here.
  • 16. LIFE TIME MANAGEMENT OF CACHE ITEM  Cache Manager: Determines how much time a cache item can be kept in DFS.  Two types of policies: • Fixed Storage quota: Least Recent Used(LRU) is employed. • Optimal utility Estimates the saved computation time ts by caching cache item for given amount of time ta. Expenses ts = P storage x Scache x ts Save ts = P computation x R duplicate x ts
  • 17. RESULT AND DISCUSSION: The graph for CPU utilization of Hadoop and Dache in the two programs:  Tera-sort Program.  Word-count Program.
  • 18. RESULT AND DISCUSSION(Cont.): The graph for Completion time for the two programs using Dache and Hadoop.  Tera-sort Program.  Word-count Program.
  • 19. RESULT AND DISCUSSION(Cont.): The graph for Total cache size in GB for the two programs using Dache and Hadoop.  Tera-sort Program.  Word-count Program.
  • 20. CONCLUSION:  Requires minimum change to the original MapReduce programming model.  Application code only requires slight changes in order to utilize Dache.  Implement Dache in Hadoop by extending relevant components.  Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.  Minimum execution time and CPU utilization.
  • 21. REFERENCES:  J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communication of ACM, vol. 51, no. 1, pp. 107-113, 2008.  Hadoop, http://Hadoop.apache.org, 2013.  Cache algorithms, http://en.wikipedia.org/wiki/Cache algorithms, 2013.  Java programming language, http://www.java.com/, 2013.  Google compute engine, http://cloud.google.com/products/computeengine.html, 2013.