Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed computing environment
Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers. Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost.
2. Whatâs Hadoop??
ï” Apache Hadoop is an open source software framework used to develop data
processing applications which are executed in a distributed computing
environment
ï” Applications built using HADOOP are run on large data sets distributed across
clusters of commodity computers. Commodity computers are cheap and widely
available. These are mainly useful for achieving greater computational power at
low cost.
3. Continue.....
Similar to data residing in a local file system of a
personal computer system, in Hadoop, data resides
in a distributed file system which is called as a
Hadoop Distributed File system. The processing
model is based on âData Localityâ concept wherein
computational logic is sent to cluster nodes(server)
containing data. This computational logic is nothing,
but a compiled version of a program written in a high-
level language such as Java. Such a program,
processes data stored in Hadoop HDFS.
4. Components of Hadoop
ï” Hadoop MapReduce: MapReduce is a computational model and software
framework for writing applications which are run on Hadoop. These MapReduce
programs are capable of processing enormous data in parallel on large clusters of
computation nodes.
ï” HDFS (Hadoop Distributed File System): HDFS takes care of the storage part
of Hadoop applications. MapReduce applications consume data from HDFS.
HDFS creates multiple replicas of data blocks and distributes them on compute
nodes in a cluster. This distribution enables reliable and extremely rapid
computations.
5. Hadoop Architecture
High Level Hadoop
Architecture
Hadoop has a Master-Slave
Architecture for data storage and
distributed data processing using
MapReduce and HDFS methods.
6. Continue....
NameNode:
ï” NameNode represented every files and directory which is used in the namespace
DataNode:
ï” DataNode helps you to manage the state of an HDFS node and allows you to interacts with the
blocks
MasterNode:aq
ï” The master node allows you to conduct parallel processing of data using Hadoop MapReduce.
7. Continue...
Slave node:
The slave nodes are the additional machines in the Hadoop cluster which
allows you to store data to conduct complex calculations. Moreover, all the
slave node comes with Task Tracker and a DataNode. This allows you to
synchronize the processes with the NameNode and Job Tracker respectively.
In Hadoop, master or slave system can be set up in the cloud or on-premise
8. ï” JobTracker â Schedules jobs and tracks the assign jobs to Task tracker.
ï” Task Tracker â Tracks the task and reports status to JobTracker.
ï” Job â A program is an execution of a Mapper and Reducer across a dataset.
ï” Task â An execution of a Mapper or a Reducer on a slice of data.
ï” Task Attempt â A particular instance of an attempt to execute a task on a SlaveNode
9. ï” Implementation of Hadoop system with big data will not set back the
healthcare analytics. Few benefits offered by Hadoop are as follows:
ï” It makes data storage less expensive.
ï” It allows data to be always available.
ï” It allows storage and management of huge amount of data.
ï” It facilitates researchers in establishing the interdependence of data with
different variables which is tough to perform for humans.
10.
11. ï” When the MapReduce framework was not there, how parallel and distributed
processing used to happen in a traditional way. So, let us take an example
where I have a weather log containing the daily average temperature of the
years from 2000 to 2015. Here, I want to calculate the day having the highest
temperature in each year.
ï” So, just like in the traditional way, I will split the data into smaller parts or
blocks and store them in different machines. Then, I will find the highest
temperature in each part stored in the corresponding machine. At last, I will
combine the results received from each of the machines to have the final
output.
12. Advantage
Parallel Processing:
, we are dividing the job among multiple
nodes and each node works with a part of
the job simultaneously. So, MapReduce is
based on Divide and Conquer paradigm
which helps us to process the data using
different machines. As the data is
processed by multiple machines instead of
a single machine in parallel, the time taken
to process the data gets reduced by a
tremendous amount as shown in the figure
below