BigData Hadoop

Summer Training Seminar
BigData Hadoop

By:
KUMARI SURABHI

COMPANY OVERVIEW:
●
Company name : LinuxWorld Informatics Pvt Ltd.
●
LinuxWorld Informatics Pvt Ltd - RedHat Awarded Partner,
Cisco Learning Partner and An ISO 9001:2008 Certified
Company is dedicated to offering a comprehensive set of
most useful Open Source and Commercial training
programmes today’s industry demands.
●
This organization is specialized in providing training to the
students of B.Tech, M. Tech., MCA, BCA and other students
who are pursuing course in computer related technologies.

COMPANY OVERVIEW Continued...
●
Core Division of the
organisation :
Training & Development
Services
Technical Support Services
Research & Development
Centre
●
Courses provided by the
organisation :
RedHat Linux
Cloud Computing
BigData Hadoop
DevOps

1. Course : BigData Hadoop
2. Technology Learned :
●
Hadoop
●
MapReduce
●
Single node & Multi node Cluster
●
Dockers
●
Ansible
●
Python

What Is Big Data ?
● Big data is a term for data sets that are so large or complex that
traditional data processing application software is inadequate to deal
with them.
● Generally speaking, big data is:
● Large Datasets
● The category of computing strategies and technologies that are used
to handle large datasets.
● "Large dataset" means a dataset too large to reasonably process or
store with traditional tooling or on a single computer

● Social Media Data:
Social networking sites such as Face book andTwitter contains the information and the
views posted by millions of people across the globe.
● Black Box Data:
It is an incorporated by flight crafts, which stores a large sum of information, which
includes the conversation between crew members and any other communications (alert
messages or any order passed)by the technical grounds duty staff.
● Search Engine Data:
Search engines retrieve a large amount of data from different sources of database.

● Stock Exchange Data:
It holds information (complete details of in and out of business transactions) about the
‘buyer’ and ‘seller’ decisions in terms of share between different companies made by the
customers.
● Power Grid Data:
The power grid data mainly holds the information consumed by a particular node in
terms of base station.
● Transport Data:
It includes the data’s from various transport sectors such as model, capacity, distance
and availability of a vehicle.

BigData Challenges & Issues

4 V’s of BigData :
● Volume
● Variety
● Velocity
● Veracity

VOLUME
● The main characteristic that makes data “big” is the
sheer volume.
● Volume defines the huge amount of data that is
produced each day by companies.
● The generation of data is so large and complex that
it can no longer be saved or analyzed using
conventional data processing methods.

VARIETY
● Variety refers to the diversity of data types and data
sources.
● Types of data :
Structured
Semi-structured
Unstructured

VARIETY Continued..
Structured Data :
● Structured data is very banal.
● Structured data refers to any data that resides in a fixed
field within a record or file.
● It concerns all data which can be stored in database SQL in
table with rows and columns and spreadsheets.
● Structured data refers to any data that resides in a fixed
field within a record or file.

VARIETY Continued..
Unstructured Data :
● Unstructured data represent around 80% of data.
● It is all those things that can't be so readily classified and fit
into a neat box
● It often include text and multimedia content.
● Examples include e-mail messages, word processing
documents, videos, photos, audio files, presentations,
webpages and many other kinds of business documents.

VARIETY Continued..
Semistructured Data :
● Semi-structured data is information that doesn’t reside in a
relational database but that does have some organizational
properties that make it easier to analyze.
● Examples of semi-structured :
CSV but XML and JSON documents are semi structured
documents, NoSQL databases are considered as semi structured.
● Note : Structured data, semi structured data represents a few
parts of data (5 to 10%) so the last data type is the strong one :
unstructured data.

VELOCITY
● Velocity is the frequency of incoming data that needs to be
generated, analyzed and processed.
● Today this is mostly possible within a fraction of a second, known as
real time.
● Think about how many SMS messages, Facebook status updates, or
credit card swipes are being sent on a particular telecom carrier every
minute of every day, and you’ll have a good appreciation of velocity.
● A streaming application like AmazonWeb Services is an example of
an application that handles the velocity of data.

VERACITY
● Veracity == Quality
● A lot of data and a big variety of data with fast access are
not enough. The data must have quality and produce
credible results that enable right action when it comes to
end of life decision making.
● Veracity refers to the biases, noise and abnormality in data
and it also refers to the trustworthiness of the data.

Traditional Enterprise Approach
● This approach of enterprise will use a computer to store and process big data.
● For storage purpose is available of their choice of database vendors such as
Oracle, IBM, etc.
● The user interacts with the application, which executes data storage and
analysis.

LIMITATION
● This approach are good for those applications which
require low storage, processing and database capabilities,
but when it comes to dealing with large amounts of
scalable data, it imposes a bottleneck.

SOLUTION
● Google solved this problem
using an algorithm based on
MapReduce.
● This algorithm divides the task
into small parts or units and
assigns them to multiple
computers, and intermediate
results together integrated
results in the desired results.

HADOOP
● Apache Hadoop is the most important framework for working
with Big Data.
● Hadoop is open source framework written in JAVA.
● It efficiently processes large volumes of data on a cluster of
commodity hardware.
● Hadoop can be setup on single machine, but the real power of
Hadoop comes with a cluster of machines.
● It can be scaled from a single machine to thousands of nodes.

HADOOP Continued...
● Hadoop biggest strength is scalability.
● It upgrades from working on a single node to thousands of
nodes without any issue in a seamless manner.
● It is intended to work upon from a single server to thousands
of machines each offering local computation and storage.
● It supports the large collection of data set in a distributed
computing environment.

Hadoop Framework Architecture

Hadoop HighLevel
Architecture

Hadoop Architecture based on the two main
components namely MapReduce and HDFS :

HDFS(Hadoop Distributed File System)
● Hadoop Distributed File System provides unrestricted, high-speed access
to the data application.
● A scalable, fault tolerant, high performance distributed file system.
● Namenode holds filesystem metadata.
● Files are broken up and spread over datanodes.
● Data divided into 64MB(default) or 128 blocks, each block replicated 3
times(default) .

MAPREDUCE
● MapReduce is a programming model and for processing and generating big data sets with a
parallel, distributed algorithm on a cluster.
● “Map” Step : Each worker node applies the "map()" function to the local data, and writes the
output to a temporary storage. A master node ensures that only one copy of redundant
input data is processed.
● “Shuffle” Step :Worker nodes redistribute data based on the output keys (produced by the
"map()" function), such that all data belonging to one key is located on the same worker
node.
● “Reduce” Step :Worker nodes now process each group of output data, per key, in parallel.

The world’s leading software
container platform

DOCKER
● Docker is the world’s leading software container platform
● What is a container ?
Containers are a way to package software in a format that can run isolated on a
shared operating system. UnlikeVMs, containers do not bundle a full operating
system - only libraries and settings required to make the software work are
needed.This makes for efficient, lightweight, self-contained systems and
guarantees that software will always run the same, regardless of where it’s
deployed.

WHYUSE DOCKER ?
Docker automates the repetitive tasks of setting up and
configuring development environments so that developers
can focus on what matters: building great software.

BigData Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie BigData Hadoop

Ähnlich wie BigData Hadoop (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

BigData Hadoop