SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Summer Training Seminar
BigData Hadoop
                               
 By:
KUMARI SURABHI
COMPANY OVERVIEW: 
●
Company name : LinuxWorld Informatics Pvt Ltd.
●
LinuxWorld Informatics Pvt Ltd - RedHat Awarded Partner,
Cisco Learning Partner and An ISO 9001:2008 Certified
Company is dedicated to offering a comprehensive set of
most useful Open Source and Commercial training
programmes today’s industry demands.
●
This organization is specialized in providing training to the
students of B.Tech, M. Tech., MCA, BCA and other students
who are pursuing course in computer related technologies.
COMPANY OVERVIEW Continued... 
●
Core Division of the
organisation :
Training & Development
Services
Technical Support Services
Research & Development
Centre
●
Courses provided by the
organisation :
RedHat Linux
Cloud Computing
BigData Hadoop
DevOps
WHAT  I  LEARNED ?
1. Course :  BigData Hadoop
2. Technology Learned : 
●
Hadoop
●
MapReduce
●
Single node & Multi node Cluster
●
Dockers
●
Ansible
●
Python
What Is Big Data ?
● Big data is a term for data sets that are so large or complex that
traditional data processing application software is inadequate to deal
with them.
● Generally speaking, big data is:
● Large Datasets
● The category of computing strategies and technologies that are used
to handle large datasets.
● "Large dataset" means a dataset too large to reasonably process or
store with traditional tooling or on a single computer
 Categories Of BigData
● Social Media Data:
Social networking sites such as Face book andTwitter contains the information and the
views posted by millions of people across the globe.
● Black Box Data:
It is an incorporated by flight crafts, which stores a large sum of information, which
includes the conversation between crew members and any other communications (alert
messages or any order passed)by the technical grounds duty staff.
● Search Engine Data:
Search engines retrieve a large amount of data from different sources of database.
● Stock Exchange Data:
It holds information (complete details of in and out of business transactions) about the
‘buyer’ and ‘seller’ decisions in terms of share between different companies made by the
customers.
● Power Grid Data:
The power grid data mainly holds the information consumed by a particular node in
terms of base station.
● Transport Data:
It includes the data’s from various transport sectors such as model, capacity, distance
and availability of a vehicle.
BigData Challenges & Issues
4 V’s of BigData : 
● Volume
● Variety
● Velocity
● Veracity
VOLUME   
● The main characteristic that makes data “big” is the
sheer volume.
● Volume defines the huge amount of data that is
produced each day by companies.
● The generation of data is so large and complex that
it can no longer be saved or analyzed using
conventional data processing methods.
VARIETY   
● Variety refers to the diversity of data types and data
sources.
● Types of data :
Structured
Semi-structured
Unstructured
  VARIETY Continued..
         Structured Data :   
● Structured data is very banal.
● Structured data refers to any data that resides in a fixed
field within a record or file.
● It concerns all data which can be stored in database SQL in
table with rows and columns and spreadsheets.
● Structured data refers to any data that resides in a fixed
field within a record or file.
  VARIETY Continued..
         Unstructured Data :   
● Unstructured data represent around 80% of data.
● It is all those things that can't be so readily classified and fit
into a neat box
● It often include text and multimedia content.
● Examples include e-mail messages, word processing
documents, videos, photos, audio files, presentations,
webpages and many other kinds of business documents.
  VARIETY Continued..
     Semi­structured Data :   
● Semi-structured data is information that doesn’t reside in a
relational database but that does have some organizational
properties that make it easier to analyze.
● Examples of semi-structured :
CSV but XML and JSON documents are semi structured
documents, NoSQL databases are considered as semi structured.
● Note : Structured data, semi structured data represents a few
parts of data (5 to 10%) so the last data type is the strong one :
unstructured data.
         VELOCITY   
● Velocity is the frequency of incoming data that needs to be
generated, analyzed and processed.
● Today this is mostly possible within a fraction of a second, known as
real time.
● Think about how many SMS messages, Facebook status updates, or
credit card swipes are being sent on a particular telecom carrier every
minute of every day, and you’ll have a good appreciation of velocity.
● A streaming application like AmazonWeb Services is an example of
an application that handles the velocity of data.
         VERACITY   
● Veracity == Quality
● A lot of data and a big variety of data with fast access are
not enough. The data must have quality and produce
credible results that enable right action when it comes to
end of life decision making.
● Veracity refers to the biases, noise and abnormality in data
and it also refers to the trustworthiness of the data.
BIGDATA SOLUTIONS
Traditional Enterprise Approach
● This approach of enterprise will use a computer to store and process big data.
● For storage purpose is available of their choice of database vendors such as
Oracle, IBM, etc.
● The user interacts with the application, which executes data storage and
analysis.
         LIMITATION   
● This approach are good for those applications which
require low storage, processing and database capabilities,
but when it comes to dealing with large amounts of
scalable data, it imposes a bottleneck.
           SOLUTION   
● Google solved this problem
using an algorithm based on
MapReduce.
● This algorithm divides the task
into small parts or units and
assigns them to multiple
computers, and intermediate
results together integrated
results in the desired results.
Hadoop As A Rescue
             HADOOP
● Apache Hadoop is the most important framework for working
with Big Data.
● Hadoop is open source framework written in JAVA.
● It efficiently processes large volumes of data on a cluster of
commodity hardware.
● Hadoop can be setup on single machine, but the real power of
Hadoop comes with a cluster of machines.
● It can be scaled from a single machine to thousands of nodes.
           HADOOP Continued...
● Hadoop biggest strength is scalability.
● It upgrades from working on a single node to thousands of
nodes without any issue in a seamless manner.
● It is intended to work upon from a single server to thousands
of machines each offering local computation and storage.
● It supports the large collection of data set in a distributed
computing environment.
Hadoop Framework Architecture
Hadoop High­Level 
Architecture
Hadoop Architecture based on the two main 
components namely MapReduce and HDFS :
HDFS & MapReduce
  HDFS(Hadoop Distributed File System)
● Hadoop Distributed File System provides unrestricted, high-speed access
to the data application.
● A scalable, fault tolerant, high performance distributed file system.
● Namenode holds filesystem metadata.
● Files are broken up and spread over datanodes.
● Data divided into 64MB(default) or 128 blocks, each block replicated 3
times(default) .
  ARCHITECTURE OF HDFS
       WORKING OF HDFS
MAPREDUCE
● MapReduce is a programming model and for processing and generating big data sets with a
parallel, distributed algorithm on a cluster.
● “Map” Step : Each worker node applies the "map()" function to the local data, and writes the
output to a temporary storage. A master node ensures that only one copy of redundant
input data is processed.
● “Shuffle” Step :Worker nodes redistribute data based on the output keys (produced by the
"map()" function), such that all data belonging to one key is located on the same worker
node.
● “Reduce” Step :Worker nodes now process each group of output data, per key, in parallel.
     MAPREDUCE  PROCESS
 The world’s leading software 
container platform
      VM’s vs CONTAINERS
DOCKER
● Docker is the world’s leading software container platform
● What is a container ?
Containers are a way to package software in a format that can run isolated on a
shared operating system. UnlikeVMs, containers do not bundle a full operating
system - only libraries and settings required to make the software work are
needed.This makes for efficient, lightweight, self-contained systems and
guarantees that software will always run the same, regardless of where it’s
deployed.
WHYUSE DOCKER  ?
Docker automates the repetitive tasks of setting up and
configuring development environments so that developers
can focus on what matters: building great software.
ANY QUERIES ?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsNguyen Cao
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computingViet-Trung TRAN
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azureEyal Ben Ivri
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDBMark Kromer
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 

Was ist angesagt? (20)

Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Big Data Telecom
Big Data TelecomBig Data Telecom
Big Data Telecom
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Data lake
Data lakeData lake
Data lake
 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
BigData
BigDataBigData
BigData
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 

Ähnlich wie BigData Hadoop

Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshersrajkamaltibacademy
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Denodo
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Overcoming Today's Data Challenges with MongoDB
Overcoming Today's Data Challenges with MongoDBOvercoming Today's Data Challenges with MongoDB
Overcoming Today's Data Challenges with MongoDBMongoDB
 
Traditional data word
Traditional data wordTraditional data word
Traditional data wordorcoxsm
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Overcoming Today's Data Challenges with MongoDB
Overcoming Today's Data Challenges with MongoDBOvercoming Today's Data Challenges with MongoDB
Overcoming Today's Data Challenges with MongoDBMongoDB
 

Ähnlich wie BigData Hadoop (20)

Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Big Data
Big DataBig Data
Big Data
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Overcoming Today's Data Challenges with MongoDB
Overcoming Today's Data Challenges with MongoDBOvercoming Today's Data Challenges with MongoDB
Overcoming Today's Data Challenges with MongoDB
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Dataweek-Talk-2014
Dataweek-Talk-2014Dataweek-Talk-2014
Dataweek-Talk-2014
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Overcoming Today's Data Challenges with MongoDB
Overcoming Today's Data Challenges with MongoDBOvercoming Today's Data Challenges with MongoDB
Overcoming Today's Data Challenges with MongoDB
 

Kürzlich hochgeladen

Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 

Kürzlich hochgeladen (20)

Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 

BigData Hadoop

  • 2. COMPANY OVERVIEW:  ● Company name : LinuxWorld Informatics Pvt Ltd. ● LinuxWorld Informatics Pvt Ltd - RedHat Awarded Partner, Cisco Learning Partner and An ISO 9001:2008 Certified Company is dedicated to offering a comprehensive set of most useful Open Source and Commercial training programmes today’s industry demands. ● This organization is specialized in providing training to the students of B.Tech, M. Tech., MCA, BCA and other students who are pursuing course in computer related technologies.
  • 3. COMPANY OVERVIEW Continued...  ● Core Division of the organisation : Training & Development Services Technical Support Services Research & Development Centre ● Courses provided by the organisation : RedHat Linux Cloud Computing BigData Hadoop DevOps
  • 6.
  • 7. What Is Big Data ? ● Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. ● Generally speaking, big data is: ● Large Datasets ● The category of computing strategies and technologies that are used to handle large datasets. ● "Large dataset" means a dataset too large to reasonably process or store with traditional tooling or on a single computer
  • 8.
  • 10. ● Social Media Data: Social networking sites such as Face book andTwitter contains the information and the views posted by millions of people across the globe. ● Black Box Data: It is an incorporated by flight crafts, which stores a large sum of information, which includes the conversation between crew members and any other communications (alert messages or any order passed)by the technical grounds duty staff. ● Search Engine Data: Search engines retrieve a large amount of data from different sources of database.
  • 11. ● Stock Exchange Data: It holds information (complete details of in and out of business transactions) about the ‘buyer’ and ‘seller’ decisions in terms of share between different companies made by the customers. ● Power Grid Data: The power grid data mainly holds the information consumed by a particular node in terms of base station. ● Transport Data: It includes the data’s from various transport sectors such as model, capacity, distance and availability of a vehicle.
  • 14. VOLUME    ● The main characteristic that makes data “big” is the sheer volume. ● Volume defines the huge amount of data that is produced each day by companies. ● The generation of data is so large and complex that it can no longer be saved or analyzed using conventional data processing methods.
  • 15. VARIETY    ● Variety refers to the diversity of data types and data sources. ● Types of data : Structured Semi-structured Unstructured
  • 16.   VARIETY Continued..          Structured Data :    ● Structured data is very banal. ● Structured data refers to any data that resides in a fixed field within a record or file. ● It concerns all data which can be stored in database SQL in table with rows and columns and spreadsheets. ● Structured data refers to any data that resides in a fixed field within a record or file.
  • 17.   VARIETY Continued..          Unstructured Data :    ● Unstructured data represent around 80% of data. ● It is all those things that can't be so readily classified and fit into a neat box ● It often include text and multimedia content. ● Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.
  • 18.   VARIETY Continued..      Semi­structured Data :    ● Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. ● Examples of semi-structured : CSV but XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured. ● Note : Structured data, semi structured data represents a few parts of data (5 to 10%) so the last data type is the strong one : unstructured data.
  • 19.          VELOCITY    ● Velocity is the frequency of incoming data that needs to be generated, analyzed and processed. ● Today this is mostly possible within a fraction of a second, known as real time. ● Think about how many SMS messages, Facebook status updates, or credit card swipes are being sent on a particular telecom carrier every minute of every day, and you’ll have a good appreciation of velocity. ● A streaming application like AmazonWeb Services is an example of an application that handles the velocity of data.
  • 20.          VERACITY    ● Veracity == Quality ● A lot of data and a big variety of data with fast access are not enough. The data must have quality and produce credible results that enable right action when it comes to end of life decision making. ● Veracity refers to the biases, noise and abnormality in data and it also refers to the trustworthiness of the data.
  • 21.
  • 23. Traditional Enterprise Approach ● This approach of enterprise will use a computer to store and process big data. ● For storage purpose is available of their choice of database vendors such as Oracle, IBM, etc. ● The user interacts with the application, which executes data storage and analysis.
  • 24.          LIMITATION    ● This approach are good for those applications which require low storage, processing and database capabilities, but when it comes to dealing with large amounts of scalable data, it imposes a bottleneck.
  • 25.            SOLUTION    ● Google solved this problem using an algorithm based on MapReduce. ● This algorithm divides the task into small parts or units and assigns them to multiple computers, and intermediate results together integrated results in the desired results.
  • 27.              HADOOP ● Apache Hadoop is the most important framework for working with Big Data. ● Hadoop is open source framework written in JAVA. ● It efficiently processes large volumes of data on a cluster of commodity hardware. ● Hadoop can be setup on single machine, but the real power of Hadoop comes with a cluster of machines. ● It can be scaled from a single machine to thousands of nodes.
  • 28.            HADOOP Continued... ● Hadoop biggest strength is scalability. ● It upgrades from working on a single node to thousands of nodes without any issue in a seamless manner. ● It is intended to work upon from a single server to thousands of machines each offering local computation and storage. ● It supports the large collection of data set in a distributed computing environment.
  • 33.   HDFS(Hadoop Distributed File System) ● Hadoop Distributed File System provides unrestricted, high-speed access to the data application. ● A scalable, fault tolerant, high performance distributed file system. ● Namenode holds filesystem metadata. ● Files are broken up and spread over datanodes. ● Data divided into 64MB(default) or 128 blocks, each block replicated 3 times(default) .
  • 36. MAPREDUCE ● MapReduce is a programming model and for processing and generating big data sets with a parallel, distributed algorithm on a cluster. ● “Map” Step : Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. ● “Shuffle” Step :Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. ● “Reduce” Step :Worker nodes now process each group of output data, per key, in parallel.
  • 40. DOCKER ● Docker is the world’s leading software container platform ● What is a container ? Containers are a way to package software in a format that can run isolated on a shared operating system. UnlikeVMs, containers do not bundle a full operating system - only libraries and settings required to make the software work are needed.This makes for efficient, lightweight, self-contained systems and guarantees that software will always run the same, regardless of where it’s deployed.
  • 41. WHYUSE DOCKER  ? Docker automates the repetitive tasks of setting up and configuring development environments so that developers can focus on what matters: building great software.
  • 42.