Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
2. ABSTRACT
Hadoop is a framework for running applications on large clusters built of
commodity hardware. The Hadoop framework transparently provides applications
both reliability and data motion. Hadoop implements a computational paradigm
named Map/Reduce, where the application is divided into many small fragments
of work, each of which may be executed or reexecuted on any node in the cluster.
In addition, it provides a distributed file system (HDFS) that stores data on the
compute nodes, providing very high aggregate bandwidth across the cluster. Both
Map/Reduce and the distributed file system are designed so that node failures are
automatically handled by the framework. 2
3. Problem Statement:
The amount total digital data in the world has exploded in recent years.
This has happened primarily due to information (or data) generated by various
enterprises all over the globe. In 2006, the universal data was estimated to be 0.18
zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.
The problem is that while the storage capacities of hard drives have
increased massively over the years, access speeds—the rate at which data can be
read from drives have not kept up. One typical drive from 1990 could store 1370
MB of data and had a transfer speed of 4.4 MB/s, so we could read all the data
from a full drive in around 300 seconds. In 2010, 1 Tb drives are the standard
hard disk size, but the transfer speed is around 100 MB/s, so it takes more than
two and a half hours to read all the data off the disk.
3
4. Solution Proposed:
Parallelisation:
A very obvious solution to solving this problem is parallelisation. The
input data is usually large and the computations have to be distributed across
hundreds or thousands of machines in order to finish in a reasonable amount of
time. Reading 1 Tb from a single hard drive may take a long time, but on
parallelizing this over 100 different machines can solve the problem in 2 minutes.
Apache Hadoop is a framework for running applications on large cluster built of
commodity hardware. The Hadoop framework transparently provides applications
both reliability and data motion.
It solves the problem of Hardware Failure through replication.
Redundant copies of the data are kept by the system so that in the event of failure,
there is another copy available. (Hadoop Distributed File System)
The second problem is solved by a simple programming model-
MapReduce. This programming paradigm abstracts the problem from data
read/write to computation over a series of keys. Even though HDFS and
MapReduce are the most significant features of Hadoop. 4
84. Conclusion
Hadoop (MapReduce) is one of the very powerful frameworks that enable easy
development on data-intensive application. It objective is help building a
supplication with high scalability with thousands of machines. We can see
Hadoop is very suitable to data-intensive background application and perfect fit to
our project‟s requirements. Apart from running application in parallel, Hadoop
provides some job monitoring features similar to Azure. If any machine crash,
the data could be recovered by other machines, and it will take up the jobs
automatically. When we put Hadoop into cloud, we also see the convenience in
setting up Hadoop. With a few command lines, we can allocate any number of
clusters to run Hadoop, this may save lot of time and effort. We found the
combination of cloud and Hadoop is surely a common way to setup large scale
application with lower cost, but higher elastic property. 84