2. About Me
SCJP/OCJP - Oracle Certified Java Programmer
MCP:70-480 - Specialist certification in HTML5
with JavaScript and CSS3 Exam
Skills : Java, Swings, Springs,
Hibernate, JavaFX, Jquery,
prototypeJS, ExtJS.
Connect Me :
https://www.facebook.com/prem.c.mali
http://www.linkedin.com/in/premmali
https://twitter.com/prem_mali
https://plus.google.com/106150245941317924019/about/p/pub
Contact Me :
premchandm@mindfiresolutions.com / prem.c.mali@gmail.com
mfsi_premchandm
Presenter: Prem Chand Mali, Mindfire Solutions
3. Agenda
History
What is Apache Hadoop
Why Apache Hadoop
HDFS
MapReduce
Q&A
Presenter: Prem Chand Mali, Mindfire Solutions
4. History
• Nutch Crawler based search
• GFS and Map Reduce paper published.
• Yahoo! hired Doug Cutting and given dedicated team.
Presenter: Prem Chand Mali, Mindfire Solutions
5. What is Apache Hadoop ?
• Apache Hadoop is an open-source software framework that supports dataintensive distributed applications licensed under the Apache v2 license. It supports
running applications on large clusters of commodity hardware.
• Hadoop are designed with a fundamental assumption that hardware failures (of
individual machines, or racks of machines) are common and thus should be
automatically handled in software by the framework.
• Apache Hadoop's MapReduce and HDFS components originally derived
respectively from Google's MapReduce and Google File System (GFS) papers.
Presenter: Prem Chand Mali, Mindfire Solutions
6. What is Apache Hadoop ?
• The Apache Hadoop framework is composed of the following modules :
– Hadoop Distributed File System (HDFS) - a distributed file-system that stores
data on the commodity machines, providing very high aggregate bandwidth
across the cluster.
– Hadoop MapReduce - a programming model for large scale data processing.
– Hadoop Common - contains libraries and utilities needed by other Hadoop
modules
– Hadoop YARN - a resource-management platform responsible for managing
compute resources in clusters and using them for scheduling of users'
applications.
Presenter: Prem Chand Mali, Mindfire Solutions
7. Why Apache Hadoop ?
• State of Data
– 90% of data in past three years.
– Type of data
• Unstructured
• Semi-structured
• Relational
– Relation world can handle GB of data.
• Distributed
• Scalable
• Flexible
• Fault tolerant
• Intelligent
Presenter: Prem Chand Mali, Mindfire Solutions
8. HDFS
• HDFS is the primary distributed storage used by Hadoop applications. It consist of
following two type of components.
– NameNode
– DataNode
• HDFS, is well suited for distributed storage and distributed processing using
commodity hardware.
• Hadoop supports shell-like commands to interact with HDFS directly.
Presenter: Prem Chand Mali, Mindfire Solutions
10. MapReduce
• MapReduce if combination of following three things.
– Map
– Shuffle
– Reduce
• It done it's job through Job Tracker and Task Tracker
Presenter: Prem Chand Mali, Mindfire Solutions