SlideShare ist ein Scribd-Unternehmen logo
1 von 50
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters 02/28/11 Xiao Qin Department of Computer Science and Software Engineering Auburn University http://www.eng.auburn.edu/~xqin [email_address] Slides 2-20 are adapted from notes by Subbarao Kambhampati (ASU), Dan Weld (U. Washington), Jeff Dean, Sanjay Ghemawat, (Google, Inc.)
Motivation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Map/Reduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Distributed Grep Very  big data Split data Split data Split data Split data grep grep grep grep matches matches matches matches cat All matches
Distributed Word Count Very  big data Split data Split data Split data Split data count count count count count count count count merge merged count
Map Reduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Very  big data Result M A P R E D U C E Partitioning Function
Map in Lisp (Scheme) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Unary operator Binary operator
Map/Reduce ala Google ,[object Object],[object Object],[object Object],[object Object]
count words in docs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Count,  Illustrated ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Grep ,[object Object],[object Object],[object Object],[object Object],[object Object]
Reverse Web-Link Graph ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Model is Widely Applicable MapReduce Programs In Google Source Tree  Example uses:  distributed grep   distributed sort    web link-graph reversal  term-vector / host web access log stats  inverted index construction  document clustering  machine learning  statistical machine translation  ...  ...  ...
Implementation Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Execution ,[object Object],[object Object],[object Object],[object Object],[object Object]
Job Processing JobTracker TaskTracker 0 TaskTracker 1 TaskTracker 2 TaskTracker 3 TaskTracker 4 TaskTracker 5 ,[object Object],[object Object],[object Object],[object Object],[object Object],“ grep”
Execution
Parallel Execution
Task Granularity & Pipelining ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MapReduce outside Google ,[object Object],[object Object],[object Object],Master Slave MapReduce jobtracker tasktracker DFS namenode datanode
Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Download Software at:   http://www.eng.auburn.edu/~xqin/software/hdfs-hc This HDFS-HC tool was described in our paper  -  Improving  MapReduce  Performance via Data Placement in Heterogeneous  Hadoop  Clusters  - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010.
Hadoop Overview (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters.  OSDI ’04, pages 137–150, 2008)
One time setup ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hadoop Distributed File System (http://lucene.apache.org/hadoop)
Motivational Example Time (min) Node A (fast) Node B (slow) Node C (slowest) 2x slower 3x slower 1 task/min
The Native Strategy Node A Node B Node C 3 tasks 2 tasks 6 tasks Loading Transferring  Processing  Time (min)
Our Solution --Reducing data transfer time Node A’ Node B’ Node C’ 3 tasks 2 tasks 6 tasks Loading Transferring  Processing  Time (min) Node A
Preliminary Results Impact of data placement on performance of grep
Challenges    ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Measure Computing Ratios ,[object Object],[object Object],Time  Node A Node B Node C 2x slower 3x slower 1 task/min
Steps to Measure Computing Ratios 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes  3.Caculate the least common multiple of these ratios 4. Count the portion of each node Node Response time(s) Ratio # of File Fragments Speed Node A 10 1 6 Fastest Node B 20 2 3 Average Node C 30 3 2 Slowest
Initial Data Distribution Namenode Datanodes File1 6 c ,[object Object],[object Object],C B A Portion   3:2:1 1 2 3 4 5 7 8 9 a b
Data Redistribution ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],1 Namenode 1 2 3 4 5 6 7 8 9 a b c C A C B A B 2 3 4 L1 L2 Portion   3:2:1
Sharing Files among  Multiple Applications ,[object Object],[object Object],[object Object]
Experimental Environment Five nodes in a hadoop heterogeneous cluster Node CPU Model CPU(Hz) L1 Cache(KB) Node A Intel core 2 Duo 2*1G=2G 204 Node B Intel Celeron 2.8G 256 Node C Intel Pentium 3 1.2G 256 Node D Intel Pentium 3 1.2G 256 Node  E Intel Pentium 3 1.2G 256
Grep and WordCount ,[object Object],[object Object]
Computing ratio for two applications Computing ratio of the five nodes with respective of Grep and Wordcount applications Computing Node Ratios for Grep Ratios for Wordcount Node A 1 1 Node B 2 2 Node C 3.3 5 Node D 3.3 5 Node E 3.3 5
Response time of Grep and wordcount in each Node Application dependence Data size independence
Six Data Placement Decisions
Impact of data placement on performance of Grep
Impact of data placement on performance of WordCount
Conclusion ,[object Object],[object Object]
Future Work ,[object Object],[object Object],[object Object]
Fellowship Program  Samuel Ginn College of Engineering at Auburn University ,[object Object],[object Object],[object Object],[object Object]
http://www.eng.auburn.edu/programs/grad-school/fellowship-program/
http://www.eng.auburn.edu
http://www.eng.auburn.edu
http://www.eng.auburn.edu
Download the presentation slides http://www.slideshare.net/xqin74 Google:  slideshare Xiao Qin
Questions

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasortpramodbiligiri
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 

Was ist angesagt? (20)

MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasort
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
MapReduce
MapReduceMapReduce
MapReduce
 

Ähnlich wie HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 

Ähnlich wie HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters (20)

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
E031201032036
E031201032036E031201032036
E031201032036
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Handout3o
Handout3oHandout3o
Handout3o
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 

Mehr von Xiao Qin

How to apply for internship positions?
How to apply for internship positions?How to apply for internship positions?
How to apply for internship positions?Xiao Qin
 
How to write research papers? Version 5.0
How to write research papers? Version 5.0How to write research papers? Version 5.0
How to write research papers? Version 5.0Xiao Qin
 
Making a competitive nsf career proposal: Part 2 Worksheet
Making a competitive nsf career proposal: Part 2 WorksheetMaking a competitive nsf career proposal: Part 2 Worksheet
Making a competitive nsf career proposal: Part 2 WorksheetXiao Qin
 
Making a competitive nsf career proposal: Part 1 Tips
Making a competitive nsf career proposal: Part 1 TipsMaking a competitive nsf career proposal: Part 1 Tips
Making a competitive nsf career proposal: Part 1 TipsXiao Qin
 
Auburn csse faculty orientation
Auburn csse faculty orientationAuburn csse faculty orientation
Auburn csse faculty orientationXiao Qin
 
Auburn CSSE graduate student orientation
Auburn CSSE graduate student orientationAuburn CSSE graduate student orientation
Auburn CSSE graduate student orientationXiao Qin
 
CSSE Graduate Programs Committee: Progress Report
CSSE Graduate Programs Committee: Progress ReportCSSE Graduate Programs Committee: Progress Report
CSSE Graduate Programs Committee: Progress ReportXiao Qin
 
Project 2 How to modify os161: A Manual
Project 2 How to modify os161: A ManualProject 2 How to modify os161: A Manual
Project 2 How to modify os161: A ManualXiao Qin
 
Project 2 how to modify OS/161
Project 2 how to modify OS/161Project 2 how to modify OS/161
Project 2 how to modify OS/161Xiao Qin
 
Project 2 how to install and compile os161
Project 2 how to install and compile os161Project 2 how to install and compile os161
Project 2 how to install and compile os161Xiao Qin
 
Project 2 - how to compile os161?
Project 2 - how to compile os161?Project 2 - how to compile os161?
Project 2 - how to compile os161?Xiao Qin
 
Understanding what our customer wants-slideshare
Understanding what our customer wants-slideshareUnderstanding what our customer wants-slideshare
Understanding what our customer wants-slideshareXiao Qin
 
OS/161 Overview
OS/161 OverviewOS/161 Overview
OS/161 OverviewXiao Qin
 
Surviving a group project
Surviving a group projectSurviving a group project
Surviving a group projectXiao Qin
 
P#1 stream of praise
P#1 stream of praiseP#1 stream of praise
P#1 stream of praiseXiao Qin
 
Data center specific thermal and energy saving techniques
Data center specific thermal and energy saving techniquesData center specific thermal and energy saving techniques
Data center specific thermal and energy saving techniquesXiao Qin
 
How to do research?
How to do research?How to do research?
How to do research?Xiao Qin
 
COMP2710 Software Construction: header files
COMP2710 Software Construction: header filesCOMP2710 Software Construction: header files
COMP2710 Software Construction: header filesXiao Qin
 
COMP2710: Software Construction - Linked list exercises
COMP2710: Software Construction - Linked list exercisesCOMP2710: Software Construction - Linked list exercises
COMP2710: Software Construction - Linked list exercisesXiao Qin
 
How to add system calls to OS/161
How to add system calls to OS/161How to add system calls to OS/161
How to add system calls to OS/161Xiao Qin
 

Mehr von Xiao Qin (20)

How to apply for internship positions?
How to apply for internship positions?How to apply for internship positions?
How to apply for internship positions?
 
How to write research papers? Version 5.0
How to write research papers? Version 5.0How to write research papers? Version 5.0
How to write research papers? Version 5.0
 
Making a competitive nsf career proposal: Part 2 Worksheet
Making a competitive nsf career proposal: Part 2 WorksheetMaking a competitive nsf career proposal: Part 2 Worksheet
Making a competitive nsf career proposal: Part 2 Worksheet
 
Making a competitive nsf career proposal: Part 1 Tips
Making a competitive nsf career proposal: Part 1 TipsMaking a competitive nsf career proposal: Part 1 Tips
Making a competitive nsf career proposal: Part 1 Tips
 
Auburn csse faculty orientation
Auburn csse faculty orientationAuburn csse faculty orientation
Auburn csse faculty orientation
 
Auburn CSSE graduate student orientation
Auburn CSSE graduate student orientationAuburn CSSE graduate student orientation
Auburn CSSE graduate student orientation
 
CSSE Graduate Programs Committee: Progress Report
CSSE Graduate Programs Committee: Progress ReportCSSE Graduate Programs Committee: Progress Report
CSSE Graduate Programs Committee: Progress Report
 
Project 2 How to modify os161: A Manual
Project 2 How to modify os161: A ManualProject 2 How to modify os161: A Manual
Project 2 How to modify os161: A Manual
 
Project 2 how to modify OS/161
Project 2 how to modify OS/161Project 2 how to modify OS/161
Project 2 how to modify OS/161
 
Project 2 how to install and compile os161
Project 2 how to install and compile os161Project 2 how to install and compile os161
Project 2 how to install and compile os161
 
Project 2 - how to compile os161?
Project 2 - how to compile os161?Project 2 - how to compile os161?
Project 2 - how to compile os161?
 
Understanding what our customer wants-slideshare
Understanding what our customer wants-slideshareUnderstanding what our customer wants-slideshare
Understanding what our customer wants-slideshare
 
OS/161 Overview
OS/161 OverviewOS/161 Overview
OS/161 Overview
 
Surviving a group project
Surviving a group projectSurviving a group project
Surviving a group project
 
P#1 stream of praise
P#1 stream of praiseP#1 stream of praise
P#1 stream of praise
 
Data center specific thermal and energy saving techniques
Data center specific thermal and energy saving techniquesData center specific thermal and energy saving techniques
Data center specific thermal and energy saving techniques
 
How to do research?
How to do research?How to do research?
How to do research?
 
COMP2710 Software Construction: header files
COMP2710 Software Construction: header filesCOMP2710 Software Construction: header files
COMP2710 Software Construction: header files
 
COMP2710: Software Construction - Linked list exercises
COMP2710: Software Construction - Linked list exercisesCOMP2710: Software Construction - Linked list exercises
COMP2710: Software Construction - Linked list exercises
 
How to add system calls to OS/161
How to add system calls to OS/161How to add system calls to OS/161
How to add system calls to OS/161
 

Kürzlich hochgeladen

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

  • 1. HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters 02/28/11 Xiao Qin Department of Computer Science and Software Engineering Auburn University http://www.eng.auburn.edu/~xqin [email_address] Slides 2-20 are adapted from notes by Subbarao Kambhampati (ASU), Dan Weld (U. Washington), Jeff Dean, Sanjay Ghemawat, (Google, Inc.)
  • 2.
  • 3.
  • 4. Distributed Grep Very big data Split data Split data Split data Split data grep grep grep grep matches matches matches matches cat All matches
  • 5. Distributed Word Count Very big data Split data Split data Split data Split data count count count count count count count count merge merged count
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. Model is Widely Applicable MapReduce Programs In Google Source Tree Example uses: distributed grep   distributed sort   web link-graph reversal term-vector / host web access log stats inverted index construction document clustering machine learning statistical machine translation ... ... ...
  • 14.
  • 15.
  • 16.
  • 19.
  • 20.
  • 21. Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Download Software at: http://www.eng.auburn.edu/~xqin/software/hdfs-hc This HDFS-HC tool was described in our paper -  Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters  - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010.
  • 22. Hadoop Overview (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)
  • 23.
  • 24. Hadoop Distributed File System (http://lucene.apache.org/hadoop)
  • 25. Motivational Example Time (min) Node A (fast) Node B (slow) Node C (slowest) 2x slower 3x slower 1 task/min
  • 26. The Native Strategy Node A Node B Node C 3 tasks 2 tasks 6 tasks Loading Transferring Processing Time (min)
  • 27. Our Solution --Reducing data transfer time Node A’ Node B’ Node C’ 3 tasks 2 tasks 6 tasks Loading Transferring Processing Time (min) Node A
  • 28. Preliminary Results Impact of data placement on performance of grep
  • 29.
  • 30.
  • 31. Steps to Measure Computing Ratios 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes 3.Caculate the least common multiple of these ratios 4. Count the portion of each node Node Response time(s) Ratio # of File Fragments Speed Node A 10 1 6 Fastest Node B 20 2 3 Average Node C 30 3 2 Slowest
  • 32.
  • 33.
  • 34.
  • 35. Experimental Environment Five nodes in a hadoop heterogeneous cluster Node CPU Model CPU(Hz) L1 Cache(KB) Node A Intel core 2 Duo 2*1G=2G 204 Node B Intel Celeron 2.8G 256 Node C Intel Pentium 3 1.2G 256 Node D Intel Pentium 3 1.2G 256 Node E Intel Pentium 3 1.2G 256
  • 36.
  • 37. Computing ratio for two applications Computing ratio of the five nodes with respective of Grep and Wordcount applications Computing Node Ratios for Grep Ratios for Wordcount Node A 1 1 Node B 2 2 Node C 3.3 5 Node D 3.3 5 Node E 3.3 5
  • 38. Response time of Grep and wordcount in each Node Application dependence Data size independence
  • 39. Six Data Placement Decisions
  • 40. Impact of data placement on performance of Grep
  • 41. Impact of data placement on performance of WordCount
  • 42.
  • 43.
  • 44.
  • 49. Download the presentation slides http://www.slideshare.net/xqin74 Google: slideshare Xiao Qin

Hinweis der Redaktion

  1. Cite here What is the hadoop? http://www.cloudera.com/what-is-hadoop/ Hadoop is an open-source project administered by the Apache Software Foundation . Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a genuinely innovative platform for consolidating, combining and understanding data. Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce. Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.
  2. Slow down, add animation of speculative task helping Note: Mention that we tested this on a heterogeneous Hadoop cluster
  3. Copy page 7 to here Note: Please add legend on this diagram. E.g., black bars represent ******, red bars represent ***** Show data movement (migration)
  4. Real results for the aforementioned motivational example.
  5. Note: Explain what is the definition of computing ratio
  6. Note: Name the over-utilized node list and the under-utilized node list on the diagram. a->A, b->B
  7. The heterogeneity measurement of a cluster depends on data-intensive applications. If multiple MapReduce applications must process the same input file, the data placement mechanism may need to distribute the input file’s fragments in several ways - one for each MapReduce application. In the case where multiple applications are similar in terms of data processing speed, one data placement decision may fit the needs of all the application
  8. Note: improve the resolution of the figures
  9. Title Application dependence Independenced of Data size Note 1: improve the quality of the figures. Large figures and large legend Note 2:
  10. number