SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Introduction to
By Laxmi Edi M.Tech (Ph.D)
Topics
What is Big Data?
Limitations of the existing solutions
Solving the problem with Hadoop
Introduction to Hadoop
HadoopEco-System
Hadoop Core Components
HDFS Architecture
MapRedcueJob execution
Anatomy of a File Write and Read
www.kerneltraining.com/course-cat/big-data
What Is Big Data?
•Lots of Data (Terabytes or Petabytes)
•Big data is a term applied to data sets whose size is beyond the ability
of commonly used software tools to capture, manage, and process the
data within a tolerable elapsed time.
•Big data is the term for a collection of data sets so large and
complexthat it becomes difficultto process using on-hand database
management tools or traditional data processing applications. The
challenges include capture, curation, storage, search, sharing, transfer,
analysis, and visualization.
•Systems / Enterprises generate huge amount of data from Terabytes to
and even Petabytesof information.
www.kerneltraining.com/course-cat/big-data
NYSE generates about one terabyte of new trade data per day to
Perform stock trading analytics to determine trends for optimal trades.
www.kerneltraining.com/course-cat/big-data
Where does Big Data come from?
Now the next question would be from where this Big Data originates, what
makes the Big Data?
Basically the data coming from everywhere like
• sensors used to gather climate information
• posts to social media sites
• digital pictures and videos
• software logs, cameras
• microphones
• scans of government documents
• GPS trails
• purchase transaction records
• cell phone GPS signals
• traffic
• and many more.
All these together constitute Big Data.
www.kerneltraining.com/course-cat/big-data
Exploding Un-Structured Data
www.kerneltraining.com/course-cat/big-data
Big DataCharacteristics
www.kerneltraining.com/course-cat/big-data
1. Volume: BIG DATA depends upon how gigantic it is.
It could amount to hundreds of terabytes or even
petabytes of information. For instance, 15 terabytes of
facebook posts or 400 billion annual medical records
could mean Big Data!
2. Velocity:Velocity means the rate at which data is
flowing in the companies. Big data requires fast
processing. Time factor plays a very crucial role in
several organizations. For instance, processing 2 million
records at share market or evaluating results of lakhs of students applied for
competitive exams could mean Big Data!
3. Variety: Big Data may not belong to a specific format. It could be in any form
such as structured, unstructured, text, images, audio, video, log files, emails,
simulations, 3D models, etc. New research shows that a substantial amount of an
organization’s data is not numeric; however, such data is equally important for
decision-making process. So, organizations need to think beyond stock records,
documents, personnel files, finances, etc.
Question
www.kerneltraining.com/course-cat/big-data
Map the following to corresponding data type:
-XML Files
-Word Docs, PDF files, Text files
-E-Mail body
-Data from Enterprise systems (ERP, CRM etc.)
Answer
www.kerneltraining.com/course-cat/big-data
XML Files -> Semi-structured data
Word Docs, PDF files, Text files -> Unstructured Data
E-Mail body -> Unstructured Data
Data from Enterprise systems (ERP, CRM etc.) -> Structured Data
Big Data Customer Scenarios
www.kerneltraining.com/course-cat/big-data
Web and e-tailing
Recommendation Engines
Ad Targeting
Search Quality
Abuse and Click Fraud Detection
Telecommunications
Customer Churn Prevention
Network Performance Optimization
Calling Data Record (CDR) Analysis
Analyzing Network to Predict Failure
Big Data Customer Scenarios
www.kerneltraining.com/course-cat/big-data
Fraud Detection And Cyber Security
Welfare schemes
Justice
Healthcare & Life Sciences
Health information exchange
Gene sequencing
Serialization
Healthcare service quality improvements
Drug Safety
Big Data Customer Scenarios
www.kerneltraining.com/course-cat/big-data
Banks and Financial services
ModelingTrue Risk
Threat Analysis
Fraud Detection
Trade Surveillance
Credit Scoring And Analysis
Retail
Point of sales Transaction Analysis
Customer Churn Analysis
Sentiment Analysis
Why DFS
www.kerneltraining.com/course-cat/big-data
What Is Hadoop
www.kerneltraining.com/course-cat/big-data
Apache Hadoopis a frameworkthat allows for the distributed processing of
large data sets across clusters of commodity computers using a simple
programming model.
It is an Open-source Data Management with scale-out storage & distributed
processing.
Why Hadoop?
www.kerneltraining.com/course-cat/big-data
Key features – Why Hadoop?
1. Flexible
2. Scalable
3. Building more efficient data economy:
4. Robust Ecosystem
5. Hadoop is getting more “Real-Time”!
6. Cost Effective:
7. Upcoming Technologies using Hadoop:
8. Hadoop is getting Cloudy!
Question
www.kerneltraining.com/course-cat/big-data
Hadoop is a framework that allows for the distributed processing of:
-Small Data Sets
-Large Data Sets
Answer
www.kerneltraining.com/course-cat/big-data
Large Data Sets. It is also capable to process small data-sets however to
experience the true power of Hadoop one needs to have data in TB’s because
this where RDBMS takes hours and fails whereas Hadoop does the same in
couple of minutes.
Hadoop Eco-System
www.kerneltraining.com/course-cat/big-data
Machine Learning with Mahout
www.kerneltraining.com/course-cat/big-data
•Mahout is a data mining library.
•It takes the most popular data mining algorithms for performing clustering,
regression testing and statistical modeling and implements them using the Map
Reduce model.
Hadoop Core Components
www.kerneltraining.com/course-cat/big-data
Hadoopis a system for large scale data processing.
It has two main components:
HDFS –Hadoop Distributed File System(Storage)
Distributed across “nodes”
Natively redundant
NameNodetracks locations.
MapReduce(Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTrackermanages the TaskTrackers
Hadoop Core Components (Contd.)
www.kerneltraining.com/course-cat/big-data
HDFS Architecture
www.kerneltraining.com/course-cat/big-data
HDFS has master/slave architecture.
An HDFS cluster consists of a single Master Node, a master server that manages
the file system namespace and regulates access to files by clients.
In addition, there are a number of Slave Nodes, usually one per node in the
cluster, which manage storage attached to the nodes that they run on.
Main Components Of HDFS
www.kerneltraining.com/course-cat/big-data
NameNode:
master of the system
maintains and manages the blocks which are present on the DataNodes
DataNodes:
slaves which are deployed on each machine and provide the actual storage
responsible for serving read and write requests for the clients
HDFS - Read Anatomy
www.kerneltraining.com/course-cat/big-data
requests the block from first datanode on the list. It tries two times and if no
response then it adds the datanode to "deadnodes" list. And requests block from
next datanode on the list.
7-8. After usccessful read of all the blocks, "DFSClient" send the deadnodes list
back to NN for it to take action.
1. Client request the document
2. NN, checks the permissions
and sends back the list of blocks
and datanodes list (including port
number to talk) for each block.
3-6. "DFSClient" class on client-
side picks up first block and
HDFS - Write Anatomy
www.kerneltraining.com/course-cat/big-data
Client has to write directly to datanode. However each datanodes has to notify
receipt of each block back to client and namenode. Also each datanode passes on
the block to next datanode to write, that means client has to transmit block to
only first datanode and rest of the block movement is handled inside the cluster.
Here is the flow of data file create and write on HDFS.
Create and Write of HDFS file
•Creation and writing of a file is more
complicated than the read of a HDFS file.
•Here also NameNode(NN) never writes any data
directly to DataNodes(DN). It, as per it's role,
only manages the namespace and inodes.
THANK YOU
for attending Demo of
www.kerneltraining.com/course-cat/big-data

Weitere ähnliche Inhalte

Andere mochten auch

7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoopTaldor Group
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!Edureka!
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_securityAdam Muise
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterEdureka!
 
Hadoop security
Hadoop securityHadoop security
Hadoop securityBiju Nair
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High AvailabilityDataWorks Summit
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop AdministrationEdureka!
 

Andere mochten auch (9)

7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop Cluster
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High Availability
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

Kürzlich hochgeladen

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 

Kürzlich hochgeladen (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 

Hadoop admin presentation demo | Introduction | Basics

  • 1. Introduction to By Laxmi Edi M.Tech (Ph.D)
  • 2. Topics What is Big Data? Limitations of the existing solutions Solving the problem with Hadoop Introduction to Hadoop HadoopEco-System Hadoop Core Components HDFS Architecture MapRedcueJob execution Anatomy of a File Write and Read www.kerneltraining.com/course-cat/big-data
  • 3. What Is Big Data? •Lots of Data (Terabytes or Petabytes) •Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. •Big data is the term for a collection of data sets so large and complexthat it becomes difficultto process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. •Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytesof information. www.kerneltraining.com/course-cat/big-data
  • 4. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades. www.kerneltraining.com/course-cat/big-data
  • 5. Where does Big Data come from? Now the next question would be from where this Big Data originates, what makes the Big Data? Basically the data coming from everywhere like • sensors used to gather climate information • posts to social media sites • digital pictures and videos • software logs, cameras • microphones • scans of government documents • GPS trails • purchase transaction records • cell phone GPS signals • traffic • and many more. All these together constitute Big Data. www.kerneltraining.com/course-cat/big-data
  • 7. Big DataCharacteristics www.kerneltraining.com/course-cat/big-data 1. Volume: BIG DATA depends upon how gigantic it is. It could amount to hundreds of terabytes or even petabytes of information. For instance, 15 terabytes of facebook posts or 400 billion annual medical records could mean Big Data! 2. Velocity:Velocity means the rate at which data is flowing in the companies. Big data requires fast processing. Time factor plays a very crucial role in several organizations. For instance, processing 2 million records at share market or evaluating results of lakhs of students applied for competitive exams could mean Big Data! 3. Variety: Big Data may not belong to a specific format. It could be in any form such as structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, etc. New research shows that a substantial amount of an organization’s data is not numeric; however, such data is equally important for decision-making process. So, organizations need to think beyond stock records, documents, personnel files, finances, etc.
  • 8. Question www.kerneltraining.com/course-cat/big-data Map the following to corresponding data type: -XML Files -Word Docs, PDF files, Text files -E-Mail body -Data from Enterprise systems (ERP, CRM etc.)
  • 9. Answer www.kerneltraining.com/course-cat/big-data XML Files -> Semi-structured data Word Docs, PDF files, Text files -> Unstructured Data E-Mail body -> Unstructured Data Data from Enterprise systems (ERP, CRM etc.) -> Structured Data
  • 10. Big Data Customer Scenarios www.kerneltraining.com/course-cat/big-data Web and e-tailing Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection Telecommunications Customer Churn Prevention Network Performance Optimization Calling Data Record (CDR) Analysis Analyzing Network to Predict Failure
  • 11. Big Data Customer Scenarios www.kerneltraining.com/course-cat/big-data Fraud Detection And Cyber Security Welfare schemes Justice Healthcare & Life Sciences Health information exchange Gene sequencing Serialization Healthcare service quality improvements Drug Safety
  • 12. Big Data Customer Scenarios www.kerneltraining.com/course-cat/big-data Banks and Financial services ModelingTrue Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis Retail Point of sales Transaction Analysis Customer Churn Analysis Sentiment Analysis
  • 14. What Is Hadoop www.kerneltraining.com/course-cat/big-data Apache Hadoopis a frameworkthat allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an Open-source Data Management with scale-out storage & distributed processing.
  • 15. Why Hadoop? www.kerneltraining.com/course-cat/big-data Key features – Why Hadoop? 1. Flexible 2. Scalable 3. Building more efficient data economy: 4. Robust Ecosystem 5. Hadoop is getting more “Real-Time”! 6. Cost Effective: 7. Upcoming Technologies using Hadoop: 8. Hadoop is getting Cloudy!
  • 16. Question www.kerneltraining.com/course-cat/big-data Hadoop is a framework that allows for the distributed processing of: -Small Data Sets -Large Data Sets
  • 17. Answer www.kerneltraining.com/course-cat/big-data Large Data Sets. It is also capable to process small data-sets however to experience the true power of Hadoop one needs to have data in TB’s because this where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
  • 19. Machine Learning with Mahout www.kerneltraining.com/course-cat/big-data •Mahout is a data mining library. •It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.
  • 20. Hadoop Core Components www.kerneltraining.com/course-cat/big-data Hadoopis a system for large scale data processing. It has two main components: HDFS –Hadoop Distributed File System(Storage) Distributed across “nodes” Natively redundant NameNodetracks locations. MapReduce(Processing) Splits a task across processors “near” the data & assembles results Self-Healing, High Bandwidth Clustered storage JobTrackermanages the TaskTrackers
  • 21. Hadoop Core Components (Contd.) www.kerneltraining.com/course-cat/big-data
  • 22. HDFS Architecture www.kerneltraining.com/course-cat/big-data HDFS has master/slave architecture. An HDFS cluster consists of a single Master Node, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of Slave Nodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.
  • 23. Main Components Of HDFS www.kerneltraining.com/course-cat/big-data NameNode: master of the system maintains and manages the blocks which are present on the DataNodes DataNodes: slaves which are deployed on each machine and provide the actual storage responsible for serving read and write requests for the clients
  • 24. HDFS - Read Anatomy www.kerneltraining.com/course-cat/big-data requests the block from first datanode on the list. It tries two times and if no response then it adds the datanode to "deadnodes" list. And requests block from next datanode on the list. 7-8. After usccessful read of all the blocks, "DFSClient" send the deadnodes list back to NN for it to take action. 1. Client request the document 2. NN, checks the permissions and sends back the list of blocks and datanodes list (including port number to talk) for each block. 3-6. "DFSClient" class on client- side picks up first block and
  • 25. HDFS - Write Anatomy www.kerneltraining.com/course-cat/big-data Client has to write directly to datanode. However each datanodes has to notify receipt of each block back to client and namenode. Also each datanode passes on the block to next datanode to write, that means client has to transmit block to only first datanode and rest of the block movement is handled inside the cluster. Here is the flow of data file create and write on HDFS. Create and Write of HDFS file •Creation and writing of a file is more complicated than the read of a HDFS file. •Here also NameNode(NN) never writes any data directly to DataNodes(DN). It, as per it's role, only manages the namespace and inodes.
  • 26. THANK YOU for attending Demo of www.kerneltraining.com/course-cat/big-data