I have collected information for the beginners to provide an overview of big data and hadoop which will help them to understand the basics and give them a Start-Up.
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Big data and Hadoop overview
1. Big Data and Hadoop Overview
What it Big Data?
Big data is a term that describes the large volume of data – both structured and unstructured – that
inundate a business on a day-to-day basis. In short, big data is so large and complex that none of the
traditional data management tools are able to store it or process it efficiently.
Who Generates Big Data?
More and more data are being produced by an increasing number of electronic devices surrounding us
and on the internet. The amount of data and the frequency at whichthey are produced are so vast that
they are referred as “BIGData”.
Why Is Big Data Important?
The importance of big data doesn’t revolve around how much data you have, but what you do with it.
You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time
reductions, 3) new product development and optimized offerings, and 4) smart decision making. When
you combine big data with high-powered analytics, you can accomplish business-related tasks such as:
Determining root causes of failures, issues and defects in near-real time.
Generating coupons at the point of sale based on the customer’s buying habits.
Recalculating entire risk portfoliosin minutes.
Detecting fraudulent behaviour before it affectsyour organization.
2. Brief History of Big Data
While the term “big data” is relatively new, the act of gathering and storing large amounts of
information for eventual analysis is ages old. The concept gained momentum in the early 2000s when
industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs:
Volume Organizations collect data from a variety of sources, including business transactions, social
media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a
problem – but new technologies have eased the burden.
Velocity Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID
tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.
Variety Data comes in all types of formats – from structured, numeric data in traditional databases to
unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Now a day’s below two Vs got added
Variability In addition to the increasing velocities and varieties of data, data flows can be highly
inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-
triggered peak data loads can be challenging to manage.
Complexity Today's data comes from multiple sources, which makes it difficult to link, match, cleanse
and transform data across systems. However, it’s necessary to connect and correlate relationships,
hierarchies and multiple data linkages or your data can quickly spiral out of control.
Categories of 'Big Data'
Big data' couldbe found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
Structured:Data stored in a relational database management system is one example of
a 'structured' data.
Unstructured:Output returned by 'GoogleSearch'.
Semi-structured: Personal data stored in a XML file.
3. Evolutionof Hadoop
As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were
created to help locate relevant information amid the text-based content. In the early years, search
results were returned by humans. But as the web grew from dozens to millions of pages, automation
was needed. Web crawlers were created, many as university-led research projects, and search engine
start-ups took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting
and Mike Cafarella. They wanted to return web search results faster by distributing data and
calculations across different computers so multiple tasks could be accomplished simultaneously. During
this time, another search engine project called Google was in progress. It was based on the same
concept – storing and processing data in a distributed, automated way so that relevant web search
results could be returned faster.
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s
early work with automating distributed data storage and processing. The Nutch project was divided –
the web crawler portion remained as Nutch and the distributed computing and processing portion
became Hadoop. In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s
framework and ecosystem of technologies are managed and maintained by the non-profit Apache
Software Foundation (ASF), a global community of software developers and contributors.
FunFact: "Hadoop”was thenameof a yellow toy elephant owned by the son of one of its inventors.
4. Why is Hadoop important?
Ability to store and process huge amounts of any kind of data, quickly. With data volumes and
varieties constantly increasing, especially from social media and the Internet of Things (IoT),
that's a key consideration.
Computing power. Hadoop's distributed computing model processes big data fast. The more
computing nodes you use the more processing power you have.
Fault tolerance. Data and application processing are protected against hardware failure. If a
node goes down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail. Multiple copies of all data are stored automatically.
Flexibility. Unlike traditional relational databases, you don’t have to pre-process data before
storing it. You can store as much data as you want and decide how to use it later. That includes
unstructured data like text, images and videos.
Low cost. The open-source framework is free and uses commodity hardware to store large
quantities of data.
Scalability. You can easily grow your system to handle more data simply by adding nodes. Little
administration is required.
What are key component of Hadoop?
There are 3 core components of the Hadoop framework are:
MapReduce– A software programming model forprocessing large sets of data in parallel
HDFS – The Java-based distributed file system that can store all kinds of data withoutprior
organization.
YARN– A resource management frameworkforscheduling and handling resource requests
from distributed applications.
5. Types of Hadoop installation
There are various ways in whichHadoop can be run. Here are the various scenarios in whichHadoop
can be downloaded, installed and run.
Standalone mode
Though Hadoop is a distributed platform for working with big data, we can even install Hadoop on a
single node in a single standalone instance. This way the entire Hadoop platform runs like a system
which is running on Java. This is mostly used for the purpose of debugging. It helps if you want to check
your mapreduce applications on a single node before running on a huge cluster of Hadoop.
Fully Distributed mode
This is distributed mode that has several nodes of commodity hardware connected to form the Hadoop
cluster. In such a setup the NameNode, JobTracker and Secondary NameNode work on the master node
whereas the Datanode and the secondarydatanode work on the slave node. The other set of nodes
namely the Datanode and the TaskTracker work on the slave node.
Pseudo distributed mode
This in effect is a single node Java system that runs the entire Hadoop cluster. So the various daemons
like the NameNode, Datanode, TaskTracker and JobTracker run on the single instance of the Java
machine to form the distributed Hadoop cluster.
6. Hadoop ecosystem
HBaseA scalable distributed database that supports structured data storage forlarge tables.
HiveA data warehouse infrastructure that provides data summarization and ad hoc querying.
MahoutA Scalable machine learning and data mining library.
PigA high-level data-flow language and execution frameworkforparallel computation.
Flumeis a distributed, reliable, and available service forefficiently collecting,aggregating, and
moving large amounts of log data.
Oozieis a workflow scheduler system to manage Apache Hadoop jobs.
Sqoopis a tool designed for efficiently transferring bulk data between Apache Hadoop and
structured data stores such as relational databases.
Zookeeper isan effortto develop and maintain an open-source server which enables highly
reliable distributed coordination.
TheOracleR ConnectorforHadoop (ORCH) providesaccess to a Hadoop cluster from R,
enabling manipulation of HDFS-resident data and the execution of Mapreduce jobs.