4. • Hadoop is a Open source software framework for
distributed processing of large datasets across
large clusters of computers
• 2 Components
MapReduce engine
Distributed file system
INTRODUCTION
5. • Mapreduce engine
Programming model developed by Google
Computation component of Hadoop
Consists of Map and Reduce functions
• HDFS
Storage component of Hadoop
Splits the data into blocks and distributes them
Fault tolerant and self-healing
COMPONENTS
7. • HDFS Node
• NameNode – Maintains metadata information
about files (1 per cluster).
• DataNode – Handles all data allocation and
replication and is installed on each slave node (1
to many per cluster).
• MapReduce node
• JobTracker – Schedules job execution and keep
track of cluster wide job status (1 per cluster)
• TaskTracker – Receives tasks from job tracker.
Runs on compute nodes in conjunction with data
node (1 to many per cluster).
8.
9. SYSTEM
FEATURES
DISADVANTAG
ES
Hadoop FIFO
scheduing
Implements by
FIFO principle
Can not assign
priority for jobs
Facebook’s Fair
scheduler
Even allocation of No preemption
resources
support for large
tasks
REF [4]
Yahoo’s Capacity
scheduler
FIFO scheduler
based on priority
REF[6]
Problem in
assigning
priorities
LITERATURE SURVEY
REFERENCE
REF [6]
14. • The underutilization of CPU processes
• Not flexible
• Interaction between master node with slave nodes
EXISTING SYSTEM
(disadvantage)
15. • Analyze the system for CPU and IO underutilization
• Use a predictive scheduler for predicting the appropriate
TaskTracker
• Couple the scheduler with a prefetching mechanism to
improve the system performance
PROPSED SYSTEM
16.
17. • Flexible task scheduler
• Predicts the most appropriate task trackers to assign
future tasks
• Allows DataNodes to explore underutilization of disk
bandwidth
• Seeks stragglers and predicts candidate data blocks
PREDICTIVE SCHEDULER
18. • Integrate with predictive scheduler
• Multiple worker threads
• Monitor status of worker threads and coordinate
prefetching process
PREFETCHING MODULE
19. Copying the job from HDFS to TaskTracker
Creation of local working directory for task
Creation of TaskTracker instance
STEPS FOR LAUNCHING
TASKS
20. ISSUES IN PREFETCHING MODULE
• When to prefetch
• What to prefetch
• How much to prefetch
21. •
•
•
•
Avoidance of I/O stalls
Maximising CPU utilisation
Helps the smooth functioning of Hadoop
Flexible
ADVANTAGES
22. EXISTING SYSTEM
PROPSOED SYSTEM
Low i/o perfomance
High I/O perfomance
CPU underutilised
Proper utilisation
Less flexible
Additional overhead of prefetching to
master
COMPARISON
23. • Hadoop on demand (HOD)
• A scheduler in heterogeneous environment
FUTURE SCOPE
24. • 1. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on
large clusters. OSDI ’04, pages 137–150, 2008.
• 2. M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica. Improving
mapreduce performance in heterogeneous environments. In OSDI’08: 8th
USENIX Symposium on Operating Systems Design and Implementation,
October 2008.
• 3. R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka.
Informed prefetching and caching. SIGOPS Oper. Syst. Rev., 29:79–95,
December 1995.
• 4. Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim,et. al. Hpmr:
Prefetching and pre-shuffling in shared mapreduce computation
environment. In Proceedings of 11th IEEE International Conference on
Cluster Computing, pages 16–20. ACM, 2009.
• 5. Tom White. Hadoop The Definitive Guide. O’Reilly, 2009.
• 6. Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin
Garegrat, Shiwali Mohan
REFERENCES
ex / pro1. low i/o performence* high i/o performence2. cpu work load underutilised* proper utilisation of CPU work load3. no overhead to master* additional overhead of prefetching to master4. Suited for real time solution* not suited for real time solutions