SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Jithin Raveendran 
S7 IT 
Roll No : 31 
Guided by : 
Prof. Remesh Babu 
Presented by : 
1
BigData??? 
 Buzz-word big-data : large-scale 
distributed data processing applications 
that operate on exceptionally large 
amounts of data. 
 2.5 Zettabytes of data/day — so much 
that 90% of the data in the world today 
has been created in the last two years 
alone. 
2
3
Case study with Hadoop MapReduce 
Hadoop: 
 Open-source software framework from Apache - Distributed processing 
of large data sets across clusters of commodity servers. 
 Designed to scale up from a single server to thousands of machines, 
with a very high degree of fault tolerance. 
 Inspired by 
 Google MapReduce 
 GFS (Google File System) 
HDFS 
Map/Reduce 
4
Apache Hadoop has two pillars 
• HDFS 
• Self healing 
• High band width clustered 
storage 
• MapReduce 
• Retrieval System 
• Maper function tells the cluster 
which data points we want to 
retrieve 
• Reducer function then take all 
the data and aggregate 
5
HDFS - Architecture 
6
 Name Node: 
HDFS - Architecture 
 Center piece of an HDFS file system 
 Client applications talk to the NameNode whenever they wish to locate a 
file, or when they want to add/copy/move/delete a file. 
 Responds the successful requests by returning a list of relevant DataNode 
servers where the data lives. 
 Data Node: 
 Stores data in the Hadoop File System. 
 A functional file system has more than one Data Node, with data replicated 
across them. 
7
 Secondary Name node: 
 Act as a check point of name node 
 Takes the snapshot of the Name node and use it whenever the back 
up is needed 
 HDFS Features: 
 Rack awareness 
 Reliable storage 
 High throughput 
HDFS - Architecture 
8
MapReduce Architecture 
• Job Client: 
• Submits Jobs 
• Job Tracker: 
• Co-ordinate 
Jobs 
• Task Tracker: 
• Execute Job 
tasks 
9
MapReduce Architecture 
1. Clients submits jobs to the Job Tracker 
2. Job Tracker talks to the name node 
3. Job Tracker creates execution plan 
4. Job Tracker submit works to Task tracker 
5. Task Trackers report progress via heart beats 
6. Job Tracker manages the phases 
7. Job Tracker update the status 
10
Current System : 
 MapReduce is used for 
providing a standardized 
framework. 
 Limitation 
 Inefficiency in 
incremental processing. 
11
Proposed System 
 Dache - a data aware cache system for 
big-data applications using the 
MapReduce framework. 
 Dache aim-extending the MapReduce 
framework and provisioning a cache 
layer for efficiently identifying and 
accessing cache items in a MapReduce 
job. 
12
Related Work 
 Google Big table - handle incremental processing 
 Google Percolator - incremental processing platform 
 Ramcloud - distributed computing platform-Data on RAM 
13
Technical challenges need to be 
addressed 
 Cache description scheme: 
 Data-aware caching requires each data object to be indexed by its content. 
 Provide a customizable indexing that enables the applications to describe 
their operations and the content of their generated partial results. This is a 
nontrivial task. 
 Cache request and reply protocol: 
 The size of the aggregated intermediate data can be very large. When such 
data is requested by other worker nodes, determining how to transport 
this data becomes complex 
14
Cache Description 
 Map phase cache description scheme 
 Cache refers to the intermediate data produced by worker nodes/processes 
during the execution of a MapReduce task. 
 A piece of cached data stored in a Distributed File System (DFS). 
 Content of a cache item is described by the original data and the operations 
applied. 
2-tuple: {Origin, Operation} 
Origin : Name of a file in the DFS. 
Operation : Linear list of available operations performed on the Origin file 
15
Cache Description 
 Reduce phase cache description scheme 
 The input for the reduce phase is also a list of key-value pairs, where the 
value could be a list of values. 
 Original input and the applied operations are required. 
 Original input obtained by storing the intermediate results of the map 
phase in the DFS. 
16
Protocol 
Relationship between job types and cache 
organization 
• When processing each file split, the 
cache manager reports the previous 
file splitting scheme used in its 
cache item. 
17
Protocol 
Relationship between job types and cache 
organization 
 To find words starting with 
‘ab’, We use the results from 
the cache for word starting 
with ‘a’ ; and also add it to 
the cache 
 Find the best match among 
overlapped results [choose 
‘ab’ instead of ‘a’] 
18
Protocol 
Cache item submission 
 Mapper and reducer nodes/processes record cache items into their 
local storage space 
 Cache item should be put on the same machine as the worker 
process that generates it. 
Worker node/process contacts the cache manager each time before 
it begins processing an input data file. 
Worker process receives the tentative description and fetches the 
cache item. 
19
Lifetime management of cache item 
 Cache manager - Determine how much time a cache item can be 
kept in the DFS. 
 Two types of policies for determining the lifetime of a cache item 
1. Fixed storage quota 
• Least Recent Used (LRU) is employed 
2. Optimal utility 
• Estimates the saved computation 
time, ts, by caching a cache item for a 
given amount of time, ta. 
• ts,ta - used to derive the monetary 
gain and cost. 20
Cache request and reply 
Map Cache: 
 Cache requests must be sent out before the file splitting phase. 
 Job tracker issues cache requests to the cache manager. 
 Cache manager replies a list of cache descriptions. 
Reduce Cache : 
• First , compare the requested cache item with the cached items in 
the cache manager’s database. 
• Cache manager identify the overlaps of the original input files of 
the requested cache and stored cache. 
• Linear scan is used here. 
21
Performance Evaluation 
Implementation 
 Extend Hadoop to implement Dache by changing the components that 
are open to application developers. 
 The cache manager is implemented as an independent server. 
22
Experiment settings 
 Hadoop is run in pseudo-distributed mode on a server that has 
 8-core CPU 
 core running at 3 GHz, 
 16GB memory, 
 a SATA disk 
 Two applications to benchmark the speedup of Dache over Hadoop 
 word-count and tera-sort. 
23
Results 
24
Results 
25
Results 
26
Conclusion 
 Requires minimum change to the original MapReduce programming 
model. 
 Application code only requires slight changes in order to utilize Dache. 
 Implement Dache in Hadoop by extending relevant components. 
 Testbed experiments show that it can eliminate all the duplicate tasks in 
incremental MapReduce jobs. 
 Minimum execution time and CPU utilization. 
27
Future Work 
 This scheme utilizes much amount of cache. 
 Better cache management system will be needed. 
28
29

Weitere ähnliche Inhalte

Was ist angesagt?

Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation ContestAMIT BORUDE
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 

Was ist angesagt? (20)

Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 

Andere mochten auch

Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with SuccessBig Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with SuccessAltoros
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnMike Frampton
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveJoydeep Sen Sarma
 
UML and Data Modeling - A Reconciliation
UML and Data Modeling - A ReconciliationUML and Data Modeling - A Reconciliation
UML and Data Modeling - A Reconciliationdmurph4
 
Data-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your Business
Data-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your BusinessData-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your Business
Data-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your BusinessDATAVERSITY
 
CDO Webinar: 2017 Trends in Data Strategy
CDO Webinar: 2017 Trends in Data StrategyCDO Webinar: 2017 Trends in Data Strategy
CDO Webinar: 2017 Trends in Data StrategyDATAVERSITY
 
Data-Ed Slides: Exorcising the Seven Deadly Data Sins
Data-Ed Slides: Exorcising the Seven Deadly Data SinsData-Ed Slides: Exorcising the Seven Deadly Data Sins
Data-Ed Slides: Exorcising the Seven Deadly Data SinsDATAVERSITY
 
Data-Ed Slides: Data Architecture Strategies - Constructing Your Data Garden
Data-Ed Slides: Data Architecture Strategies - Constructing Your Data GardenData-Ed Slides: Data Architecture Strategies - Constructing Your Data Garden
Data-Ed Slides: Data Architecture Strategies - Constructing Your Data GardenDATAVERSITY
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
 
Data strategy in a Big Data world
Data strategy in a Big Data worldData strategy in a Big Data world
Data strategy in a Big Data worldCraig Milroy
 
Data Governance
Data GovernanceData Governance
Data GovernanceSambaSoup
 
A New Way of Thinking About MDM
A New Way of Thinking About MDMA New Way of Thinking About MDM
A New Way of Thinking About MDMDATAVERSITY
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use CasesInSemble
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 

Andere mochten auch (20)

UML for Data Architects
UML for Data ArchitectsUML for Data Architects
UML for Data Architects
 
Yarn
YarnYarn
Yarn
 
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with SuccessBig Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspective
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
UML and Data Modeling - A Reconciliation
UML and Data Modeling - A ReconciliationUML and Data Modeling - A Reconciliation
UML and Data Modeling - A Reconciliation
 
Data-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your Business
Data-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your BusinessData-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your Business
Data-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your Business
 
CDO Webinar: 2017 Trends in Data Strategy
CDO Webinar: 2017 Trends in Data StrategyCDO Webinar: 2017 Trends in Data Strategy
CDO Webinar: 2017 Trends in Data Strategy
 
Data-Ed Slides: Exorcising the Seven Deadly Data Sins
Data-Ed Slides: Exorcising the Seven Deadly Data SinsData-Ed Slides: Exorcising the Seven Deadly Data Sins
Data-Ed Slides: Exorcising the Seven Deadly Data Sins
 
Data-Ed Slides: Data Architecture Strategies - Constructing Your Data Garden
Data-Ed Slides: Data Architecture Strategies - Constructing Your Data GardenData-Ed Slides: Data Architecture Strategies - Constructing Your Data Garden
Data-Ed Slides: Data Architecture Strategies - Constructing Your Data Garden
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
Data strategy in a Big Data world
Data strategy in a Big Data worldData strategy in a Big Data world
Data strategy in a Big Data world
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Data Governance
Data GovernanceData Governance
Data Governance
 
Data Strategy
Data StrategyData Strategy
Data Strategy
 
A New Way of Thinking About MDM
A New Way of Thinking About MDMA New Way of Thinking About MDM
A New Way of Thinking About MDM
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy
 

Ähnlich wie Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopNushrat
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsrishavkumar1402
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
FinalprojectpresentationSANTOSH WAYAL
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With HadoopUmair Shafique
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterShivraj Raj
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 

Ähnlich wie Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework (20)

Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Anju
AnjuAnju
Anju
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 

Kürzlich hochgeladen

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Kürzlich hochgeladen (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

  • 1. Jithin Raveendran S7 IT Roll No : 31 Guided by : Prof. Remesh Babu Presented by : 1
  • 2. BigData???  Buzz-word big-data : large-scale distributed data processing applications that operate on exceptionally large amounts of data.  2.5 Zettabytes of data/day — so much that 90% of the data in the world today has been created in the last two years alone. 2
  • 3. 3
  • 4. Case study with Hadoop MapReduce Hadoop:  Open-source software framework from Apache - Distributed processing of large data sets across clusters of commodity servers.  Designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.  Inspired by  Google MapReduce  GFS (Google File System) HDFS Map/Reduce 4
  • 5. Apache Hadoop has two pillars • HDFS • Self healing • High band width clustered storage • MapReduce • Retrieval System • Maper function tells the cluster which data points we want to retrieve • Reducer function then take all the data and aggregate 5
  • 7.  Name Node: HDFS - Architecture  Center piece of an HDFS file system  Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file.  Responds the successful requests by returning a list of relevant DataNode servers where the data lives.  Data Node:  Stores data in the Hadoop File System.  A functional file system has more than one Data Node, with data replicated across them. 7
  • 8.  Secondary Name node:  Act as a check point of name node  Takes the snapshot of the Name node and use it whenever the back up is needed  HDFS Features:  Rack awareness  Reliable storage  High throughput HDFS - Architecture 8
  • 9. MapReduce Architecture • Job Client: • Submits Jobs • Job Tracker: • Co-ordinate Jobs • Task Tracker: • Execute Job tasks 9
  • 10. MapReduce Architecture 1. Clients submits jobs to the Job Tracker 2. Job Tracker talks to the name node 3. Job Tracker creates execution plan 4. Job Tracker submit works to Task tracker 5. Task Trackers report progress via heart beats 6. Job Tracker manages the phases 7. Job Tracker update the status 10
  • 11. Current System :  MapReduce is used for providing a standardized framework.  Limitation  Inefficiency in incremental processing. 11
  • 12. Proposed System  Dache - a data aware cache system for big-data applications using the MapReduce framework.  Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job. 12
  • 13. Related Work  Google Big table - handle incremental processing  Google Percolator - incremental processing platform  Ramcloud - distributed computing platform-Data on RAM 13
  • 14. Technical challenges need to be addressed  Cache description scheme:  Data-aware caching requires each data object to be indexed by its content.  Provide a customizable indexing that enables the applications to describe their operations and the content of their generated partial results. This is a nontrivial task.  Cache request and reply protocol:  The size of the aggregated intermediate data can be very large. When such data is requested by other worker nodes, determining how to transport this data becomes complex 14
  • 15. Cache Description  Map phase cache description scheme  Cache refers to the intermediate data produced by worker nodes/processes during the execution of a MapReduce task.  A piece of cached data stored in a Distributed File System (DFS).  Content of a cache item is described by the original data and the operations applied. 2-tuple: {Origin, Operation} Origin : Name of a file in the DFS. Operation : Linear list of available operations performed on the Origin file 15
  • 16. Cache Description  Reduce phase cache description scheme  The input for the reduce phase is also a list of key-value pairs, where the value could be a list of values.  Original input and the applied operations are required.  Original input obtained by storing the intermediate results of the map phase in the DFS. 16
  • 17. Protocol Relationship between job types and cache organization • When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item. 17
  • 18. Protocol Relationship between job types and cache organization  To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache  Find the best match among overlapped results [choose ‘ab’ instead of ‘a’] 18
  • 19. Protocol Cache item submission  Mapper and reducer nodes/processes record cache items into their local storage space  Cache item should be put on the same machine as the worker process that generates it. Worker node/process contacts the cache manager each time before it begins processing an input data file. Worker process receives the tentative description and fetches the cache item. 19
  • 20. Lifetime management of cache item  Cache manager - Determine how much time a cache item can be kept in the DFS.  Two types of policies for determining the lifetime of a cache item 1. Fixed storage quota • Least Recent Used (LRU) is employed 2. Optimal utility • Estimates the saved computation time, ts, by caching a cache item for a given amount of time, ta. • ts,ta - used to derive the monetary gain and cost. 20
  • 21. Cache request and reply Map Cache:  Cache requests must be sent out before the file splitting phase.  Job tracker issues cache requests to the cache manager.  Cache manager replies a list of cache descriptions. Reduce Cache : • First , compare the requested cache item with the cached items in the cache manager’s database. • Cache manager identify the overlaps of the original input files of the requested cache and stored cache. • Linear scan is used here. 21
  • 22. Performance Evaluation Implementation  Extend Hadoop to implement Dache by changing the components that are open to application developers.  The cache manager is implemented as an independent server. 22
  • 23. Experiment settings  Hadoop is run in pseudo-distributed mode on a server that has  8-core CPU  core running at 3 GHz,  16GB memory,  a SATA disk  Two applications to benchmark the speedup of Dache over Hadoop  word-count and tera-sort. 23
  • 27. Conclusion  Requires minimum change to the original MapReduce programming model.  Application code only requires slight changes in order to utilize Dache.  Implement Dache in Hadoop by extending relevant components.  Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.  Minimum execution time and CPU utilization. 27
  • 28. Future Work  This scheme utilizes much amount of cache.  Better cache management system will be needed. 28
  • 29. 29

Hinweis der Redaktion

  1. As You Can See… Can You Imagine How Big It Wud Become If We Take Statitics of 1 DAY
  2. It Works as in Master- Salve Model NameNod- Details abt Which data belongs to which Datanodes, How many copies
  3. Whenever We put data on hdfs, it breaks up into several pieces. Rack Awarenes- Ensures Multiple copies of data is on multiple DataNodes of Multiple Racks.
  4. Incremental processing refers to the applications that incrementally grow the input data and continuously apply computations on the input in order to generate output