Hadoop and Big Data

Harshdeep Kaur
Harshdeep KaurAssociate Software Consultant Developer at Seasia Infotech(JAVA) um Seasia Infotech
HADOOP
Presentation by:
Harshdeep kaur Roll
no: 7704
Submitted To:
Mrs Anu Singla
Things included:
 History of data
 What is Big Data?
 Big data challenges
 Big data solution
 Hadoop
 Hadoop architecture
 HDFS
 MapReduce
 How does hadoop work
 Enviorment setup
 Who uses hadoop
HISTORY OF DATA!!!
• Today we all generate data
• Data is in TB even in PB
• Today anything we want we just
look it on internet
• Even the K.G child rhyme is on
internet
• Earlier we need floppy's to save
our data now we move to clouds
• 90% of the data in the world today
has been created in the last two years
alone.
Organization
Est. amount of data
stored
Est. amount of data
processed per day
Ebay 200 PB 100 PB
Google 1500 PB 100 PB
Facebook 300 PB 600 TB
Twitter 200 PB 100 TB
Flood of data is coming from many resources
What is Big Data?
Big data means really a big data, it is a collection of large datasets that cannot
be processed using traditional computing techniques. It is not a single
technique or a tool, rather it involves many areas of business and technology.
According to IBM, 80% of data captured today is unstructured
That is gathered from :
 Posts to social media sites
 digital pictures and videos
 purchase transaction records
 cell phone GPS signals
 From sensors used to gather climate information
 Black box data
 Social media data
 power grid data
 Search engine data
This diagram includes
Big Data Challenges
The major challenges associated with big data are as follows:
 Capturing data
 Curation
 Storage
 searching
 Sharing
 Transfer
 Analysis
 Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
BIG DATA SOLUTIONS
Traditional Enterprise Approach
In this approach, an enterprise will have a computer to store and process big
data. For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc.
Limitation
when it comes to dealing with huge amounts of scalable data, it is a hectic task to
process such data through a single database bottleneck.
Google’s Solution
Google solved this problem using an algorithm called MapReduce.
Above diagram shows various commodity hardware which could be
single CPU machines or servers with higher capacity
Hadoop
• Doug Cutting and Mike cafarella developed an Open Source Project
called HADOOP
• Hadoop is an Apache open source framework written in java
• Hadoop allows distributed processing of large datasets across clusters
of computers using simple programming models
• Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
• Hadoop runs applications using the MapReduce algorithm.
Hadoop Architecture
Hadoop Architecture
Hadoop designed and build on two independent framework
Hadoop ═ HDFS + Map Reduce
Hadoop has a master / slave architecture for both storage and processing
Hadoop File System (HDFS) was developed using distributed file system design.
It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault-tolerant and designed
using low-cost hardware.
MapReduce is a parallel programming model for writing distributed
applications for efficient processing of large amounts of data on large clusters
(thousands of nodes) of commodity
COMPONENTS OF (HDFS)
Namenode
 The namenode i contains the GNU/Linux
operating system and the namenode software.
 The system having the namenode acts as the
master server and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such
as renaming, closing, and opening files
and directories.
COMPONENTS OF (HDFS)
Datanode
 The datanode contains GNU/Linux operating
system and datanode software.
 For every node in a cluster, there will be a
datanode. These nodes manage the data
storage of their system.
 Datanodes perform read-write operations on the
file systems, as per client request.
 They also perform operations such as block
creation, deletion, and replication according to
the instructions of the namenode.
MAP REDUCE
• MapReduce is a parallel programming model for writing distributed
applications devised and efficient processing of large amounts of data
• It is a reliable, fault-tolerant manner. The MapReduce program runs on
Hadoop
• It contains Job trackers and Task trackers
JOB TRACKER TASK TESTERS
How Does Hadoop Work?
 Hadoop runs code across a cluster of computers.
 Data is initially divided into directories and files. Files are
divided into uniform sized blocks of 128M and 64M
(preferably 128M).
 These files are then distributed across various cluster
nodes for further processing.
 HDFS, being on top of the local file system, supervises
the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
ENVIORNMENT SETUP
• Hadoop is supported by Linux platform and its flavors
• In case you have an OS other than Linux, you can install a Virtualbox
software in it
Pre-installation Setup:- we need to set up Linux using ssh (Secure Shell).
 Creating a User
 Installing Java
 Downloading Hadoop
Installing hadoop
Hadoop Operation Modes Once you have downloaded Hadoop, you can
operate your Hadoop cluster in one of the three supported mode
• Local/Standalone Mode: After downloading Hadoop in your system by
default, it is configured in a standalone mode and can be run as a single
java process.
• Pseudo Distributed Mode: It is a distributed simulation on single machine.
Each Hadoop daemon such as (hdfs), MapReduce etc., will run as a
separate java process. This mode is useful for development.
• Fully Distributed Mode: This mode is fully distributed with minimum two or
more machines as a cluster.
Installing hadoop in standalone mode
• There are no daemons running and everything runs in a single JVM.
• Standalone mode is suitable for running MapReduce programs during
development,
• since it is easy to test and debug them.
Installing hadoop in pseudo distributed mode
Step 1: Setting Up Hadoop
Step 2:Hadoop Configuration
Verifying Hadoop Installation
• Step 1:Name Node Setup
• Step 2:Verifying Hadoop dfs
• Step 3: Verifying Yarn Script
• Step 4: Accessing Hadoop on Browser
• Step 5: Verify All Applications for Cluster
Hadoop browser
Who uses hadoop
Amazon
Facebook
Last.fm
New york times
Google
Ibm
Yahoo
Twitter
Linkedln
List toooo big now
Queries
Thank you
1 von 25

Recomendados

Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
64.6K views91 Folien
Hadoop Hadoop
Hadoop ABHIJEET RAJ
525 views31 Folien
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
2.6K views42 Folien

Más contenido relacionado

Was ist angesagt?(20)

Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik215 views
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal82.3K views
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar2.4K views
HadoopHadoop
Hadoop
Nishant Gandhi19.5K views
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu2K views
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg558 views
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia2.4K views
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha2.5K views
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar5.5K views
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
Thisara Pramuditha974 views
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade2.9K views
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh477 views
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika2.9K views

Similar a Hadoop and Big Data

HadoopHadoop
Hadoopchandinisanz
74 views38 Folien
AnjuAnju
AnjuAnju Shekhawat
417 views23 Folien
Hadoop in actionHadoop in action
Hadoop in actionMahmoud Yassin
619 views26 Folien
HadoopHadoop
HadoopOded Rotter
808 views10 Folien

Similar a Hadoop and Big Data(20)

HadoopHadoop
Hadoop
chandinisanz74 views
AnjuAnju
Anju
Anju Shekhawat417 views
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
KennyPratheepKumar2 views
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin619 views
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari126 views
HadoopHadoop
Hadoop
Oded Rotter808 views
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot333 views
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar1.2K views
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov354 views
Unit 3 intro.pptxUnit 3 intro.pptx
Unit 3 intro.pptx
AkhilJoseph632 views
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman5.3K views
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood2314 views
Big data and hadoopBig data and hadoop
Big data and hadoop
Prashanth Yennampelli460 views
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane1.1K views
Hadoop trainingHadoop training
Hadoop training
TIB Academy33 views
HadoopHadoop
Hadoop
yasser hassen219 views
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness38 views
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi3449 views

Último(20)

Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver23 views
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web Developers
Maximiliano Firtman152 views
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation23 views
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya51 views

Hadoop and Big Data

  • 1. HADOOP Presentation by: Harshdeep kaur Roll no: 7704 Submitted To: Mrs Anu Singla
  • 2. Things included:  History of data  What is Big Data?  Big data challenges  Big data solution  Hadoop  Hadoop architecture  HDFS  MapReduce  How does hadoop work  Enviorment setup  Who uses hadoop
  • 3. HISTORY OF DATA!!! • Today we all generate data • Data is in TB even in PB • Today anything we want we just look it on internet • Even the K.G child rhyme is on internet • Earlier we need floppy's to save our data now we move to clouds • 90% of the data in the world today has been created in the last two years alone.
  • 4. Organization Est. amount of data stored Est. amount of data processed per day Ebay 200 PB 100 PB Google 1500 PB 100 PB Facebook 300 PB 600 TB Twitter 200 PB 100 TB Flood of data is coming from many resources
  • 5. What is Big Data? Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it involves many areas of business and technology. According to IBM, 80% of data captured today is unstructured That is gathered from :  Posts to social media sites  digital pictures and videos  purchase transaction records  cell phone GPS signals  From sensors used to gather climate information
  • 6.  Black box data  Social media data  power grid data  Search engine data This diagram includes
  • 7. Big Data Challenges The major challenges associated with big data are as follows:  Capturing data  Curation  Storage  searching  Sharing  Transfer  Analysis  Presentation To fulfill the above challenges, organizations normally take the help of enterprise servers.
  • 8. BIG DATA SOLUTIONS Traditional Enterprise Approach In this approach, an enterprise will have a computer to store and process big data. For storage purpose, the programmers will take the help of their choice of database vendors such as Oracle, IBM, etc. Limitation when it comes to dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database bottleneck.
  • 9. Google’s Solution Google solved this problem using an algorithm called MapReduce. Above diagram shows various commodity hardware which could be single CPU machines or servers with higher capacity
  • 10. Hadoop • Doug Cutting and Mike cafarella developed an Open Source Project called HADOOP • Hadoop is an Apache open source framework written in java • Hadoop allows distributed processing of large datasets across clusters of computers using simple programming models • Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. • Hadoop runs applications using the MapReduce algorithm.
  • 12. Hadoop Architecture Hadoop designed and build on two independent framework Hadoop ═ HDFS + Map Reduce Hadoop has a master / slave architecture for both storage and processing Hadoop File System (HDFS) was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault-tolerant and designed using low-cost hardware. MapReduce is a parallel programming model for writing distributed applications for efficient processing of large amounts of data on large clusters (thousands of nodes) of commodity
  • 13. COMPONENTS OF (HDFS) Namenode  The namenode i contains the GNU/Linux operating system and the namenode software.  The system having the namenode acts as the master server and it does the following tasks:  Manages the file system namespace.  Regulates client’s access to files.  It also executes file system operations such as renaming, closing, and opening files and directories.
  • 14. COMPONENTS OF (HDFS) Datanode  The datanode contains GNU/Linux operating system and datanode software.  For every node in a cluster, there will be a datanode. These nodes manage the data storage of their system.  Datanodes perform read-write operations on the file systems, as per client request.  They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
  • 15. MAP REDUCE • MapReduce is a parallel programming model for writing distributed applications devised and efficient processing of large amounts of data • It is a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop • It contains Job trackers and Task trackers
  • 16. JOB TRACKER TASK TESTERS
  • 17. How Does Hadoop Work?  Hadoop runs code across a cluster of computers.  Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).  These files are then distributed across various cluster nodes for further processing.  HDFS, being on top of the local file system, supervises the processing.  Blocks are replicated for handling hardware failure.  Checking that the code was executed successfully.
  • 18. ENVIORNMENT SETUP • Hadoop is supported by Linux platform and its flavors • In case you have an OS other than Linux, you can install a Virtualbox software in it Pre-installation Setup:- we need to set up Linux using ssh (Secure Shell).  Creating a User  Installing Java  Downloading Hadoop
  • 19. Installing hadoop Hadoop Operation Modes Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three supported mode • Local/Standalone Mode: After downloading Hadoop in your system by default, it is configured in a standalone mode and can be run as a single java process. • Pseudo Distributed Mode: It is a distributed simulation on single machine. Each Hadoop daemon such as (hdfs), MapReduce etc., will run as a separate java process. This mode is useful for development. • Fully Distributed Mode: This mode is fully distributed with minimum two or more machines as a cluster.
  • 20. Installing hadoop in standalone mode • There are no daemons running and everything runs in a single JVM. • Standalone mode is suitable for running MapReduce programs during development, • since it is easy to test and debug them.
  • 21. Installing hadoop in pseudo distributed mode Step 1: Setting Up Hadoop Step 2:Hadoop Configuration Verifying Hadoop Installation • Step 1:Name Node Setup • Step 2:Verifying Hadoop dfs • Step 3: Verifying Yarn Script • Step 4: Accessing Hadoop on Browser • Step 5: Verify All Applications for Cluster
  • 23. Who uses hadoop Amazon Facebook Last.fm New york times Google Ibm Yahoo Twitter Linkedln List toooo big now