SlideShare a Scribd company logo
1 of 81
BalajiRajan
Meetup.com/DevOps-Bangalore

str.balaji@gmail.com / balajirajan.com
Some Numbers......
# 1k => 1000 bytes

# 1kb => 1024 bytes

# 1m => 1000000 bytes

# 1mb => 1024*1024 bytes

# 1g => 1000000000 bytes # 1gb => 1024*1024*1024 bytes
# 1T => 1000000000000 bytes #Tb =>1024*1024*1024*1024 bytes
# 1Petabytes , Exabytes, zettabytes... etc

Max data in memory (RAM): 64GB
Max data per computer (disk): 24TB
Data processed by Google every month: 400PB… in 2007
Average job size: 180GB
Time: 180GB of data would take to read sequentially off a single disk drive: approximately 45
minutes

15/12/13
Data Access Speed is the Bottleneck
We can process data very quickly, but we can only read/write
it very slowly
Solution: parallel reads
– 1 HDD = 75MB/sec
– 1,000 HDDs = 75GB/sec
– Far more acceptable

15/12/13
Moving to a Cluster of Machines
* In the late 1990s, Google decided to design its architecture using
clusters of low-cost machines
– Rather than fewer, more powerful machines
* Creating an architecture around low-cost, unreliable hardware
presents a number of challenges

15/12/13
System Requirements
* System should support partial failure
* System should support data recoverability
* System should be consistent
* System should be scalable

15/12/13
Hadoop's Origins
Google created an architecture which answers these (and other)
requirements
Released two White Papers
1. 2003: Description of the Google File System (GFS)
– A method for storing data in a distributed, reliable fashion
2. 2004: Description of distributed MapReduce
– A method for processing data in a parallel fashion

15/12/13
So

Hadoop was based on these White Papers

15/12/13
Hadoop Cluster

15/12/13
HDFS Features
* Operates ‘on top of’ an existing filesystem

* Files are stored as ‘blocks’
– Much larger than for most filesystems
– Default is 64MB
* Provides reliability through replication
– Each block is replicated across multiple DataNodes
– Default replication factor is 3
* Single NameNode daemon stores metadata and co-ordinates access
– Provides simple, centralized management
* Blocks are stored on slave nodes
– Running the DataNode daemon

15/12/13
15/12/13
HDFS: Block Diagram
The NameNode






The NameNode stores all metadata
– Information about file locations in HDFS
– Information about file ownership and permissions
– Names of the individual blocks
– Locations of the blocks
Metadata is stored on disk and read when the NameNode
daemon starts up
– Filename is fsimage
When changes to the metadata are required, these are made in
RAM
– Changes are also written to a log file on disk called edits
– Full details later
The NameNode: Memory Allocation


When the NameNode is running, all meta data is held in RAM
for fast response



Each ‘item’ consumes 150-200 bytes of RAM



Items:

– Filename, permissions, etc.
– Block information for each block
The NameNode: Memory Allocation


Why HDFS prefers fewer, larger files:

– Consider 1GB of data, HDFS block size 128MB
– Stored as 1 x 1GB file
– Name: 1 item
– Blocks: 8 x 3 = 24 items
– Total items: 25
– Stored as 1000 x 1MB files
– Names: 1000 items
– Blocks: 1000 x 3 = 3000 items
– Total items: 4000
The Slave Nodes
Actual contents of the files are stored as blocks on the slave nodes
 Blocks are simply files on the slave nodes’ underlying filesystem
– Named blk_xxxxxxx
– Nothing on the slave node provides information about what
underlying file the block is a part of
– That information is only stored in the NameNode’s metadata
 Each block is stored on multiple different nodes for redundancy
– Default is three replicas
 Each slave node runs a DataNode daemon
– Controls access to the blocks
– Communicates with the NameNode

Secondary Name Node
The Secondary NameNode:




The Secondary NameNode is not a failover NameNode!
– It performs memory-intensive administrative functions for the
NameNode
– NameNode keeps information about files and blocks (the
metadata) in memory
– NameNode writes metadata changes to an editlog
– Secondary NameNode periodically combines a prior
filesystem snapshot and editlog into a new snapshot
– New snapshot is transmitted back to the NameNode
Secondary NameNode should run on a separate machine in a
large installation
– It requires as much RAM as the NameNode
Writing Files to HDFS
Anatomy of a File Write
1. Client connects to the NameNode
2. NameNode places an entry for the file in its metadata, returns the block
name and list of DataNodes to the client
3. Client connects to the first DataNode and starts sending data
4. As data is received by the first DataNode, it connects to the second and
starts sending data
5. Second DataNode similarly connects to the third
6. ack packets from the pipeline are sent back to the client
7. Client reports to the NameNode when the block is written
Reading Files from HDFS
Anatomy of a File Read






Client connects to the NameNode
NameNode returns the name and locations of the first few blocks of
the file
– Block locations are returned closest-first.
Client connects to the first of the DataNodes, and reads the block
If the DataNode fails during the read, the client will seamlessly
connect to the next one in the list to read the block
The NameNode Is Not A Bottleneck


Note: the data never travels via the NameNode
– For writes
– For reads
– During re-replication
Dealing With Data Corruption


As the DataNode is reading the block, it also calculates the checksum.



‘Live’ checksum is compared to the checksum created when the block
was stored.



If they differ, the client reads from the next DataNode in the list
– The NameNode is informed that a corrupted version of the block
has been found.
– The NameNode will then re-replicate that block elsewhere.



The DataNode verifies the checksums for blocks on a regular basis to
avoid ‘bit rot’
– Default is every three weeks after the block was created
Data Reliability and Recovery


DataNodes send heartbeats to the NameNode
– Every three seconds



After a period without any heartbeats, a DataNode is assumed
to be lost
– NameNode determines which blocks were on the lost node.
– NameNode finds other DataNodes with copies of these blocks.
– These DataNodes are instructed to copy the
blocks to other nodes.
– Three-fold replication is actively maintained.
Hadoop is ‘Rack-aware’
Hadoop is ‘Rack-aware’


Hadoop understands the concept of ‘rack awareness’
– The idea of where nodes are located, relative to one another
– Helps the JT to assign tasks to nodes closest to the data

– Helps the NN determine the ‘closest’ block to a client during reads
– In reality, this should perhaps be described as being ‘switchaware’


HDFS replicates data blocks on nodes on different racks

– Provides extra data security in case of catastrophic hardware
failure


Rack-awareness is determined by a user-defined script
Rack-aware Script
<property>
<name>topology.script.file.name</name>
<value>/etc/hadoop/topology.sh</value>
</property>
Script create a file which contains a server and rack informaton:
============
10.0.0.11 /rack1
10.0.0.12 /rack1
10.0.0.13 /rack1
10.0.0.15 /rack2
10.0.0.16 /rack2
10.0.0.17 /rack2
10.0.0.19 /rack3
10.0.0.20 /rack3
10.0.0.21 /rack3
=============
Datacenter
HDFS File Permissions




Files in HDFS have an owner, a group, and permissions
– Very similar to Unix file permissions
HDFS permissions are designed to stop good people doing
foolish things
What Is MapReduce?






MapReduce is a method for distributing a task across multiple nodes
Each node processes data stored on that node
Consists of two developer-created phases
– Map
– Reduce
In between Map and Reduce is the shuffle and sort
– Sends data from the Mappers to the Reducers
What Is MapReduce?
MapReduce: Basic Concepts










Each Mapper processes a single input split from HDFS
Hadoop passes the developer’s Map code one record at a time
Each record has a key and a value
Intermediate data is written by the Mapper to local disk
During the shuffle and sort phase, all the values associated
with the same intermediate key are transferred to the same
Reducer
– The developer specifies the number of Reducers
Reducer is passed each key and a list of all its values
– Keys are passed in sorted order
Output from the Reducers is written to HDFS
MapReduce: A Simple Example
MapReduce: A Simple Example

15/12/13
MapReduce: A Simple Example

15/12/13
Some MapReduce Terminology
* A user runs a client program on a client computer
* The client program submits a job to Hadoop
– The job consists of a mapper, a reducer, and a list of inputs
* The job is sent to the JobTracker
* Each Slave Node runs a process called the TaskTracker
* The JobTracker instructs TaskTrackers to run and monitor tasks
– A Map or Reduce over a piece of data is a single task
* A task attempt is an instance of a task running on a slave node
– Task attempts can fail, in which case they will be restarted (more later)
– There will be at least as many task attempts as there are tasks which need
be performed
15/12/13

to
Aside: The Job Submission Process
When a job is submitted, the following happens:
– The client requests and receives a new unique Job ID from the JobTracker (includes
JobTracker start time and a sequence number)
– The client calculates the input splits for the job
– How the input data will be split up between Mappers
– The client turns the job configuration information into an XML file
– The client places the XML file and the job jar into a temporary
directory in HDFS (the Job ID is included in the path)
– The client contacts the JobTracker with the location of the XML
and jar files, and the list of input splits
– The JobTracker takes over the job from this point on

15/12/13
MapReduce: High Level

15/12/13
MapReduce Failure Recovery
Task processes send heartbeats to the TaskTracker
TaskTrackers send heartbeats to the JobTracker
Any task that fails to report in 10 minutes is assumed to have failed
– Its JVM is killed by the TaskTracker
Any task that throws an exception is said to have failed
Failed tasks are reported to the JobTracker by the TaskTracker
The JobTracker reschedules any failed tasks
– It tries to avoid rescheduling the task on the same TaskTracker where it previously
failed
If a task fails four times, the whole job fails

15/12/13
MapReduce Failure Recovery
Any TaskTracker that fails to report in 10 minutes is assumed to have crashed
– All tasks on the node are restarted elsewhere
– Any TaskTracker reporting a high number of failed tasks is
blacklisted, to prevent the node from blocking the entire job
– There is also a ‘global blacklist’, for TaskTrackers which fail on
multiple jobs.

The JobTracker manages the state of each job
– Partial results of failed tasks are ignored

15/12/13
The Apache Hadoop Project
Hadoop is a ‘top-level’ Apache project
– Created and managed under the auspices of the Apache Software Foundation
Several other projects exist that rely on some or all of Hadoop
– Typically either both HDFS and MapReduce, or just HDFS
Ecosystem projects are often also top-level Apache projects
– Some are ‘Apache incubator’ projects
– Some are not managed by the Apache Software Foundation
Ecosystem projects include Hive, Pig, Sqoop, Flume, HBase,Oozie, …

15/12/13
Hive
Hive is a high-level abstraction on top of MapReduce
– Initially created by a team at Facebook
– Avoids having to write Java MapReduce code
– Data in HDFS is queried using a language very similar to SQL
– Known as HiveQL
HiveQL queries are turned into MapReduce jobs by the Hive
interpreter
– ‘Tables’ are just directories of files stored in HDFS
– A Hive Metastore contains information on how to map a file to a table
structure
15/12/13
Planning Your Hadoop Cluster
* What issues to consider when planning your Hadoop cluster
1. What types of hardware are typically used for Hadoop
nodes
2. How to optimally configure your network topology
3. How to select the right operating system and Hadoop
distribution
Cluster Growth Based on Storage Capacity
• Basing your cluster growth on storage capacity is often a

good method to use
• Example:
– Data grows by approximately 1TB per week
– HDFS set up to replicate each block three times
– Therefore, 3TB of extra storage space required per week
– Plus some overhead – say, 30%
– Assuming machines with 4 x 1TB hard drives, this equates to a
new machine required each week
– Alternatively: Two years of data – 100TB – will require
approximately 100 machines
Classifying Nodes
• Nodes can be classified as either ‘slave nodes’ or ‘master

nodes’
• Slave node runs DataNode plus TaskTracker daemons
• Master node runs either a NameNode daemon, a Secondary
NameNode Daemon, or a JobTracker daemon
– On smaller clusters, NameNode and JobTracker are often run
on the same machine
– Sometimes even Secondary NameNode is on the same
machine as the NameNode and JobTracker
– Important that at least one copy of the NameNode’s metadata is
stored on a separate machine (see later)
Slave Nodes: Recommended Configuration
• Typical ‘base’ configuration for a slave Node

– 4 x 1TB or 2TB hard drives, in a JBOD* configuration
– Do not use RAID! (See later)
– 2 x Quad-core CPUs
– 24-32GB RAM
– Gigabit Ethernet
• Multiples of (1 hard drive + 2 cores + 6-8GB RAM) tend to
work well for many types of applications
– Especially those that are I/O bound
Slave Nodes: More Details (CPU)
• Quad-core CPUs are now standard
• Hex-core CPUs are becoming more prevalent

– But are more expensive
• Hyper-threading should be enabled
• Hadoop nodes are seldom CPU-bound
– They are typically disk- and network-I/O bound
– Therefore, top-of-the-range CPUs are usually not necessary
Slave Nodes: More Details (RAM)
• Slave node configuration specifies the maximum number of Map

and Reduce tasks that can run simultaneously on that node
• Each Map or Reduce task will take 1GB to 2GB of RAM
• Slave nodes should not be using virtual memory
• Ensure you have enough RAM to run all tasks, plus overhead
for the DataNode and TaskTracker daemons, plus the operating
system
• Rule of thumb:

Total number of tasks = 1.5 x number of processor cores
-- This is a starting point, and should not be taken as a definitive
setting for all clusters
Slave Nodes: More Details (Disk)
• In general, more spindles (disks) is better
• In practice, we see anywhere from four to 12 disks per node
• Use 3.5" disks

– Faster, cheaper, higher capacity than 2.5" disks
• 7,200 RPM SATA drives are fine
– No need to buy 15,000 RPM drives
• 8 x 1.5TB drives is likely to be better than 6 x 2TB drives
– Different tasks are more likely to be accessing different disks
• A good practical maximum is 24TB per slave node
– More than that will result in massive network traffic if a node dies
and block re-replication must take place
Slave Nodes: Why Not RAID?
• Slave Nodes do not benefit from using RAID* storage

– HDFS provides built-in redundancy by replicating blocks across
multiple nodes
– RAID striping (RAID 0) is actually slower than the JBOD
configuration used by HDFS
– RAID 0 read and write operations are limited by the speed of
the slowest disk in the RAID array
– Disk operations on JBOD are independent, so the average
speed is greater than that of the slowest disk
– One test by Yahoo showed JBOD performing between 10%
and 30% faster than RAID 0, depending on the operations being
performed
What About Virtualization?
• Virtualization is usually not worth considering

– Multiple virtual nodes per machine hurts performance
– Hadoop runs optimally when it can use all the disks at once

What About Blade Servers?
Blade servers are not recommended
– Failure of a blade chassis results in many nodes being
unavailable
– Individual blades usually have very limited hard disk capacity
– Network interconnection between the chassis and top-of-rack
switch can become a bottleneck
Master Nodes: Single Points of Failure
• Slave nodes are expected to fail at some point

– This is an assumption built into Hadoop
– NameNode will automatically re-replicate blocks that were on
the failed node to other nodes in the cluster, retaining the 3x
replication requirement
– JobTracker will automatically re-assign tasks that were running
on failed nodes
• Master nodes are single points of failure
– If the NameNode goes down, the cluster is inaccessible
– If the JobTracker goes down, no jobs can run on the cluster
– All currently running jobs will fail
• Spend more money on your master nodes!
Master Node Hardware Recommendations
• Carrier-class hardware

– Not commodity hardware
• Dual power supplies
• Dual Ethernet cards
– Bonded to provide failover
• RAIDed hard drives
• At least 32GB of RAM
General Network Considerations
• Hadoop is very bandwidth-intensive!

– Often, all nodes are communicating with each other at the same
time
• Use dedicated switches for your Hadoop cluster
• Nodes are connected to a top-of-rack switch
• Nodes should be connected at a minimum speed of 1Gb/sec
• For clusters where large amounts of intermediate data is
generated, consider 10Gb/sec connections
– Expensive
– Alternative: bond two 1Gb/sec connections to each node
General Network Considerations (cont’d)
• Racks are interconnected via core switches
• Core switches should connect to top-of-rack switches at

10Gb/ sec or faster
• Beware of over-subscription in top-of-rack and core
switches
• Consider bonded Ethernet to mitigate against failure
• Consider redundant top-of-rack and core switches
Operating System Recommendations
• Choose an OS you’re comfortable administering
• CentOS: geared towards servers rather than individual

workstations
– Conservative about package versions
– Very widely used in production
• RedHat Enterprise Linux (RHEL): RedHat-supported analog
to CentOS
– Includes support contracts, for a price
• In production, we often see a mixture of RHEL and CentOS
machines
– Often RHEL on master nodes, CentOS on slaves
Configuring The System
• Do not use Linux’s LVM (Logical Volume Manager) to make

all your disks appear as a single volume
• – As with RAID 0, this limits speed to that of the slowest disk
• Check the machines’ BIOS* settings
– BIOS settings may not be configured for optimal performance
– For example, if you have SATA drives make sure IDE emulation
is not enabled
• Test disk I/O speed with hdparm -t
– Example:
hdparm -t /dev/sda1
– You should see speeds of 70MB/sec or more
Anything less is an indication of possible problems
Configuring The System
Hadoop has no specific disk partitioning requirements
– Use whatever partitioning system makes sense to you
Mount disks with the noatime option
Common directory structure for data mount points:
/data/<n>/dfs/nn
/data/<n>/dfs/dn
/data/<n>/dfs/snn
/data/<n>/mapred/local
Reduce the swappiness of the system
– Set vm.swappiness to 0 or 5 in /etc/sysctl.conf
15/12/13
Filesystem Considerations
• Cloudera recommends the ext3 and ext4 filesystems

– ext4 is now becoming more commonly used
• XFS provides some performance benefit during kickstart
– It formats in 0 seconds, vs several minutes for each disk with
ext3
• XFS has some performance issues
– Slow deletes in some versions
– Some performance improvements are available; see e.g.,
http://everything2.com/index.pl?node_id=1479435
– Some versions had problems when a machine runs out of
memory
Operating System Parameters
• Increase the nofile ulimit for the mapred and hdfs users to

at least 32K
– Setting is in /etc/security/limits.conf
• Disable IPv6
• Disable SELinux
• Install and configure the ntp daemon
– Ensures the time on all nodes is synchronized
– Important for HBase
– Useful when using logs to debug problems
Java Virtual Machine (JVM) Recommendations
• Always use the official Oracle JDK (http://java.com/)

– Hadoop is complex software, and often exposes bugs in other
JDK implementations
• Version 1.6 is required
– Avoid 1.6.0u18
– This version had significant bugs
• Hadoop is not yet production-tested with Java 7 (1.7)
• Recommendation: don’t upgrade to a new version as soon
as it is released
– Wait until it has been tested for some time
Cloudara Manager
For easy installation


Cloudera has released Cloudera Manager (CM), a tool for easy
deployment and configuration of Hadoop clusters



The free version, Cloudera Manager Free Edition, can manage
up to 50 nodes

– The version supplied with Cloudera Enterprise supports an unlimited
number of nodes
Using Cloudera Manager Free Edition
Typical Configuration
Parameters
Hadoop's Configuration Files


Each machine in the Hadoop cluster has its own set of
configuration files



Configuration files all reside in Hadoop’s conf directory

– Typically /etc/hadoop/conf


Primary configuration files are written in XML
Sample Configuration File
Sample configuration file (mapred-site.xml)
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
Core-site.xml
hdfs-site.xml


The single most important configuration value on your entire cluster, set on
the NameNode:

* Loss of the NameNode’s metadata will result in the effective loss of all the
data on the cluster
– Although the blocks will remain, there is no way of reconstructing the
original files without the metadata
* This must be at least two disks (or a RAID volume) on the NameNode,
plus an NFS mount elsewhere on the network
– Failure to set this correctly will result in eventual loss of your cluster’s
data
Mapred-site.xml
Additional Configuration Files
There are several more configuration files in /etc/hadoop/conf
– hadoop-env.sh: environment variables for Hadoop daemons
– HDFS and MapReduce include/exclude files
* Controls who can connect to the NameNode and JobTracker
– masters, slaves: hostname lists for ssh control
– hadoop-policy.xml: Access control policies
– log4j.properties: logging (covered later in the course)
– fair-scheduler.xml: Scheduler (covered later in the course)
– hadoop-metrics.properties: Monitoring (covered later in
the course)

Environment Setup: hadoop-env.sh
HADOOP_HEAPSIZE
– Controls the heap size for Hadoop daemons
– Default 1GB
– Comment this out, and set the heap for individual daemons
HADOOP_NAMENODE_OPTS
– Java options for the NameNode
– At least 4GB: -Xmx4g
HADOOP_JOBTRACKER_OPTS
– Java options for the JobTracker
– At least 4GB: -Xmx4g
HADOOP_DATANODE_OPTS, HADOOP_TASKTRACKER_OPTS
– Set to 1GB each: -Xmx1g

15/12/13
Host 'include' and 'exclude' Files


Optionally, specify dfs.hosts in hdfs-site.xml to point to a file
listing hosts which are allowed to connect to the NameNode
and act as DataNodes

– Similarly, mapred.hosts points to a file which lists hosts allowedto
connect as TaskTrackers


Both files are optional

– If omitted, any host may connect and act as a DataNode/
TaskTracker
– This is a possible security/data integrity issue


NameNode can be forced to reread the dfs.hosts file with

hadoop dfsadmin -refreshNodes
– No such command for the JobTracker, which has to be restarted
to re-read the mapred.hosts file, so many System Administrators only
create a dfs.hosts file
Managing and Scheduling
Jobs
Displaying Running Jobs
• To view all jobs running on the cluster, use

# hadoop job –list
Displaying All Jobs
• To display all jobs including completed jobs, use

# hadoop job -list all
Killing a Job
• It is important to note that once a user has submitted a job,

they can not stop it just by hitting CTRL-C on their terminal
– This stops job output appearing on the user’s console
– The job is still running on the cluster!
Killing a Job
To kill a job use hadoop job -kill <job_id>

15/12/13
Demo!!!
Reference:
1. Cloudera.com
2. Bradhedlund.com
???

More Related Content

What's hot

Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityEdureka!
 
Apache kafka configuration-guide
Apache kafka configuration-guideApache kafka configuration-guide
Apache kafka configuration-guideChetan Khatri
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosEdureka!
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterEdureka!
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 

What's hot (20)

Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
Apache kafka configuration-guide
Apache kafka configuration-guideApache kafka configuration-guide
Apache kafka configuration-guide
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Next generation technology
Next generation technologyNext generation technology
Next generation technology
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 

Viewers also liked

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop AdministrationEdureka!
 
Word count program execution steps in hadoop
Word count program execution steps in hadoopWord count program execution steps in hadoop
Word count program execution steps in hadoopjijukjoseph
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Shravan (Sean) Pabba
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessHadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessCloudera, Inc.
 
Hadoop administration using cloudera student lab guidebook
Hadoop administration using cloudera   student lab guidebookHadoop administration using cloudera   student lab guidebook
Hadoop administration using cloudera student lab guidebookNiranjan Pandey
 

Viewers also liked (19)

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Word count program execution steps in hadoop
Word count program execution steps in hadoopWord count program execution steps in hadoop
Word count program execution steps in hadoop
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessHadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster Access
 
Hadoop administration using cloudera student lab guidebook
Hadoop administration using cloudera   student lab guidebookHadoop administration using cloudera   student lab guidebook
Hadoop administration using cloudera student lab guidebook
 

Similar to Hadoop admin

HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxsunithachphd
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Zing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value DatabaseZing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value Databasezingopen
 
Zing Database
Zing Database Zing Database
Zing Database Long Dao
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsDrPDShebaKeziaMalarc
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 

Similar to Hadoop admin (20)

Hadoop
HadoopHadoop
Hadoop
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Zing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value DatabaseZing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value Database
 
Zing Database
Zing Database Zing Database
Zing Database
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
Hdfs
HdfsHdfs
Hdfs
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Hadoop admin

  • 2. Some Numbers...... # 1k => 1000 bytes # 1kb => 1024 bytes # 1m => 1000000 bytes # 1mb => 1024*1024 bytes # 1g => 1000000000 bytes # 1gb => 1024*1024*1024 bytes # 1T => 1000000000000 bytes #Tb =>1024*1024*1024*1024 bytes # 1Petabytes , Exabytes, zettabytes... etc Max data in memory (RAM): 64GB Max data per computer (disk): 24TB Data processed by Google every month: 400PB… in 2007 Average job size: 180GB Time: 180GB of data would take to read sequentially off a single disk drive: approximately 45 minutes 15/12/13
  • 3. Data Access Speed is the Bottleneck We can process data very quickly, but we can only read/write it very slowly Solution: parallel reads – 1 HDD = 75MB/sec – 1,000 HDDs = 75GB/sec – Far more acceptable 15/12/13
  • 4. Moving to a Cluster of Machines * In the late 1990s, Google decided to design its architecture using clusters of low-cost machines – Rather than fewer, more powerful machines * Creating an architecture around low-cost, unreliable hardware presents a number of challenges 15/12/13
  • 5. System Requirements * System should support partial failure * System should support data recoverability * System should be consistent * System should be scalable 15/12/13
  • 6. Hadoop's Origins Google created an architecture which answers these (and other) requirements Released two White Papers 1. 2003: Description of the Google File System (GFS) – A method for storing data in a distributed, reliable fashion 2. 2004: Description of distributed MapReduce – A method for processing data in a parallel fashion 15/12/13
  • 7. So Hadoop was based on these White Papers 15/12/13
  • 9. HDFS Features * Operates ‘on top of’ an existing filesystem * Files are stored as ‘blocks’ – Much larger than for most filesystems – Default is 64MB * Provides reliability through replication – Each block is replicated across multiple DataNodes – Default replication factor is 3 * Single NameNode daemon stores metadata and co-ordinates access – Provides simple, centralized management * Blocks are stored on slave nodes – Running the DataNode daemon 15/12/13
  • 12. The NameNode    The NameNode stores all metadata – Information about file locations in HDFS – Information about file ownership and permissions – Names of the individual blocks – Locations of the blocks Metadata is stored on disk and read when the NameNode daemon starts up – Filename is fsimage When changes to the metadata are required, these are made in RAM – Changes are also written to a log file on disk called edits – Full details later
  • 13. The NameNode: Memory Allocation  When the NameNode is running, all meta data is held in RAM for fast response  Each ‘item’ consumes 150-200 bytes of RAM  Items: – Filename, permissions, etc. – Block information for each block
  • 14. The NameNode: Memory Allocation  Why HDFS prefers fewer, larger files: – Consider 1GB of data, HDFS block size 128MB – Stored as 1 x 1GB file – Name: 1 item – Blocks: 8 x 3 = 24 items – Total items: 25 – Stored as 1000 x 1MB files – Names: 1000 items – Blocks: 1000 x 3 = 3000 items – Total items: 4000
  • 15. The Slave Nodes Actual contents of the files are stored as blocks on the slave nodes  Blocks are simply files on the slave nodes’ underlying filesystem – Named blk_xxxxxxx – Nothing on the slave node provides information about what underlying file the block is a part of – That information is only stored in the NameNode’s metadata  Each block is stored on multiple different nodes for redundancy – Default is three replicas  Each slave node runs a DataNode daemon – Controls access to the blocks – Communicates with the NameNode 
  • 17. The Secondary NameNode:   The Secondary NameNode is not a failover NameNode! – It performs memory-intensive administrative functions for the NameNode – NameNode keeps information about files and blocks (the metadata) in memory – NameNode writes metadata changes to an editlog – Secondary NameNode periodically combines a prior filesystem snapshot and editlog into a new snapshot – New snapshot is transmitted back to the NameNode Secondary NameNode should run on a separate machine in a large installation – It requires as much RAM as the NameNode
  • 19. Anatomy of a File Write 1. Client connects to the NameNode 2. NameNode places an entry for the file in its metadata, returns the block name and list of DataNodes to the client 3. Client connects to the first DataNode and starts sending data 4. As data is received by the first DataNode, it connects to the second and starts sending data 5. Second DataNode similarly connects to the third 6. ack packets from the pipeline are sent back to the client 7. Client reports to the NameNode when the block is written
  • 21. Anatomy of a File Read     Client connects to the NameNode NameNode returns the name and locations of the first few blocks of the file – Block locations are returned closest-first. Client connects to the first of the DataNodes, and reads the block If the DataNode fails during the read, the client will seamlessly connect to the next one in the list to read the block
  • 22. The NameNode Is Not A Bottleneck  Note: the data never travels via the NameNode – For writes – For reads – During re-replication
  • 23. Dealing With Data Corruption  As the DataNode is reading the block, it also calculates the checksum.  ‘Live’ checksum is compared to the checksum created when the block was stored.  If they differ, the client reads from the next DataNode in the list – The NameNode is informed that a corrupted version of the block has been found. – The NameNode will then re-replicate that block elsewhere.  The DataNode verifies the checksums for blocks on a regular basis to avoid ‘bit rot’ – Default is every three weeks after the block was created
  • 24. Data Reliability and Recovery  DataNodes send heartbeats to the NameNode – Every three seconds  After a period without any heartbeats, a DataNode is assumed to be lost – NameNode determines which blocks were on the lost node. – NameNode finds other DataNodes with copies of these blocks. – These DataNodes are instructed to copy the blocks to other nodes. – Three-fold replication is actively maintained.
  • 26. Hadoop is ‘Rack-aware’  Hadoop understands the concept of ‘rack awareness’ – The idea of where nodes are located, relative to one another – Helps the JT to assign tasks to nodes closest to the data – Helps the NN determine the ‘closest’ block to a client during reads – In reality, this should perhaps be described as being ‘switchaware’  HDFS replicates data blocks on nodes on different racks – Provides extra data security in case of catastrophic hardware failure  Rack-awareness is determined by a user-defined script
  • 27. Rack-aware Script <property> <name>topology.script.file.name</name> <value>/etc/hadoop/topology.sh</value> </property> Script create a file which contains a server and rack informaton: ============ 10.0.0.11 /rack1 10.0.0.12 /rack1 10.0.0.13 /rack1 10.0.0.15 /rack2 10.0.0.16 /rack2 10.0.0.17 /rack2 10.0.0.19 /rack3 10.0.0.20 /rack3 10.0.0.21 /rack3 =============
  • 29. HDFS File Permissions   Files in HDFS have an owner, a group, and permissions – Very similar to Unix file permissions HDFS permissions are designed to stop good people doing foolish things
  • 30. What Is MapReduce?     MapReduce is a method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two developer-created phases – Map – Reduce In between Map and Reduce is the shuffle and sort – Sends data from the Mappers to the Reducers
  • 32. MapReduce: Basic Concepts        Each Mapper processes a single input split from HDFS Hadoop passes the developer’s Map code one record at a time Each record has a key and a value Intermediate data is written by the Mapper to local disk During the shuffle and sort phase, all the values associated with the same intermediate key are transferred to the same Reducer – The developer specifies the number of Reducers Reducer is passed each key and a list of all its values – Keys are passed in sorted order Output from the Reducers is written to HDFS
  • 34. MapReduce: A Simple Example 15/12/13
  • 35. MapReduce: A Simple Example 15/12/13
  • 36. Some MapReduce Terminology * A user runs a client program on a client computer * The client program submits a job to Hadoop – The job consists of a mapper, a reducer, and a list of inputs * The job is sent to the JobTracker * Each Slave Node runs a process called the TaskTracker * The JobTracker instructs TaskTrackers to run and monitor tasks – A Map or Reduce over a piece of data is a single task * A task attempt is an instance of a task running on a slave node – Task attempts can fail, in which case they will be restarted (more later) – There will be at least as many task attempts as there are tasks which need be performed 15/12/13 to
  • 37. Aside: The Job Submission Process When a job is submitted, the following happens: – The client requests and receives a new unique Job ID from the JobTracker (includes JobTracker start time and a sequence number) – The client calculates the input splits for the job – How the input data will be split up between Mappers – The client turns the job configuration information into an XML file – The client places the XML file and the job jar into a temporary directory in HDFS (the Job ID is included in the path) – The client contacts the JobTracker with the location of the XML and jar files, and the list of input splits – The JobTracker takes over the job from this point on 15/12/13
  • 39. MapReduce Failure Recovery Task processes send heartbeats to the TaskTracker TaskTrackers send heartbeats to the JobTracker Any task that fails to report in 10 minutes is assumed to have failed – Its JVM is killed by the TaskTracker Any task that throws an exception is said to have failed Failed tasks are reported to the JobTracker by the TaskTracker The JobTracker reschedules any failed tasks – It tries to avoid rescheduling the task on the same TaskTracker where it previously failed If a task fails four times, the whole job fails 15/12/13
  • 40. MapReduce Failure Recovery Any TaskTracker that fails to report in 10 minutes is assumed to have crashed – All tasks on the node are restarted elsewhere – Any TaskTracker reporting a high number of failed tasks is blacklisted, to prevent the node from blocking the entire job – There is also a ‘global blacklist’, for TaskTrackers which fail on multiple jobs. The JobTracker manages the state of each job – Partial results of failed tasks are ignored 15/12/13
  • 41. The Apache Hadoop Project Hadoop is a ‘top-level’ Apache project – Created and managed under the auspices of the Apache Software Foundation Several other projects exist that rely on some or all of Hadoop – Typically either both HDFS and MapReduce, or just HDFS Ecosystem projects are often also top-level Apache projects – Some are ‘Apache incubator’ projects – Some are not managed by the Apache Software Foundation Ecosystem projects include Hive, Pig, Sqoop, Flume, HBase,Oozie, … 15/12/13
  • 42. Hive Hive is a high-level abstraction on top of MapReduce – Initially created by a team at Facebook – Avoids having to write Java MapReduce code – Data in HDFS is queried using a language very similar to SQL – Known as HiveQL HiveQL queries are turned into MapReduce jobs by the Hive interpreter – ‘Tables’ are just directories of files stored in HDFS – A Hive Metastore contains information on how to map a file to a table structure 15/12/13
  • 43. Planning Your Hadoop Cluster * What issues to consider when planning your Hadoop cluster 1. What types of hardware are typically used for Hadoop nodes 2. How to optimally configure your network topology 3. How to select the right operating system and Hadoop distribution
  • 44. Cluster Growth Based on Storage Capacity • Basing your cluster growth on storage capacity is often a good method to use • Example: – Data grows by approximately 1TB per week – HDFS set up to replicate each block three times – Therefore, 3TB of extra storage space required per week – Plus some overhead – say, 30% – Assuming machines with 4 x 1TB hard drives, this equates to a new machine required each week – Alternatively: Two years of data – 100TB – will require approximately 100 machines
  • 45. Classifying Nodes • Nodes can be classified as either ‘slave nodes’ or ‘master nodes’ • Slave node runs DataNode plus TaskTracker daemons • Master node runs either a NameNode daemon, a Secondary NameNode Daemon, or a JobTracker daemon – On smaller clusters, NameNode and JobTracker are often run on the same machine – Sometimes even Secondary NameNode is on the same machine as the NameNode and JobTracker – Important that at least one copy of the NameNode’s metadata is stored on a separate machine (see later)
  • 46. Slave Nodes: Recommended Configuration • Typical ‘base’ configuration for a slave Node – 4 x 1TB or 2TB hard drives, in a JBOD* configuration – Do not use RAID! (See later) – 2 x Quad-core CPUs – 24-32GB RAM – Gigabit Ethernet • Multiples of (1 hard drive + 2 cores + 6-8GB RAM) tend to work well for many types of applications – Especially those that are I/O bound
  • 47. Slave Nodes: More Details (CPU) • Quad-core CPUs are now standard • Hex-core CPUs are becoming more prevalent – But are more expensive • Hyper-threading should be enabled • Hadoop nodes are seldom CPU-bound – They are typically disk- and network-I/O bound – Therefore, top-of-the-range CPUs are usually not necessary
  • 48. Slave Nodes: More Details (RAM) • Slave node configuration specifies the maximum number of Map and Reduce tasks that can run simultaneously on that node • Each Map or Reduce task will take 1GB to 2GB of RAM • Slave nodes should not be using virtual memory • Ensure you have enough RAM to run all tasks, plus overhead for the DataNode and TaskTracker daemons, plus the operating system • Rule of thumb: Total number of tasks = 1.5 x number of processor cores -- This is a starting point, and should not be taken as a definitive setting for all clusters
  • 49. Slave Nodes: More Details (Disk) • In general, more spindles (disks) is better • In practice, we see anywhere from four to 12 disks per node • Use 3.5" disks – Faster, cheaper, higher capacity than 2.5" disks • 7,200 RPM SATA drives are fine – No need to buy 15,000 RPM drives • 8 x 1.5TB drives is likely to be better than 6 x 2TB drives – Different tasks are more likely to be accessing different disks • A good practical maximum is 24TB per slave node – More than that will result in massive network traffic if a node dies and block re-replication must take place
  • 50. Slave Nodes: Why Not RAID? • Slave Nodes do not benefit from using RAID* storage – HDFS provides built-in redundancy by replicating blocks across multiple nodes – RAID striping (RAID 0) is actually slower than the JBOD configuration used by HDFS – RAID 0 read and write operations are limited by the speed of the slowest disk in the RAID array – Disk operations on JBOD are independent, so the average speed is greater than that of the slowest disk – One test by Yahoo showed JBOD performing between 10% and 30% faster than RAID 0, depending on the operations being performed
  • 51. What About Virtualization? • Virtualization is usually not worth considering – Multiple virtual nodes per machine hurts performance – Hadoop runs optimally when it can use all the disks at once What About Blade Servers? Blade servers are not recommended – Failure of a blade chassis results in many nodes being unavailable – Individual blades usually have very limited hard disk capacity – Network interconnection between the chassis and top-of-rack switch can become a bottleneck
  • 52. Master Nodes: Single Points of Failure • Slave nodes are expected to fail at some point – This is an assumption built into Hadoop – NameNode will automatically re-replicate blocks that were on the failed node to other nodes in the cluster, retaining the 3x replication requirement – JobTracker will automatically re-assign tasks that were running on failed nodes • Master nodes are single points of failure – If the NameNode goes down, the cluster is inaccessible – If the JobTracker goes down, no jobs can run on the cluster – All currently running jobs will fail • Spend more money on your master nodes!
  • 53. Master Node Hardware Recommendations • Carrier-class hardware – Not commodity hardware • Dual power supplies • Dual Ethernet cards – Bonded to provide failover • RAIDed hard drives • At least 32GB of RAM
  • 54. General Network Considerations • Hadoop is very bandwidth-intensive! – Often, all nodes are communicating with each other at the same time • Use dedicated switches for your Hadoop cluster • Nodes are connected to a top-of-rack switch • Nodes should be connected at a minimum speed of 1Gb/sec • For clusters where large amounts of intermediate data is generated, consider 10Gb/sec connections – Expensive – Alternative: bond two 1Gb/sec connections to each node
  • 55. General Network Considerations (cont’d) • Racks are interconnected via core switches • Core switches should connect to top-of-rack switches at 10Gb/ sec or faster • Beware of over-subscription in top-of-rack and core switches • Consider bonded Ethernet to mitigate against failure • Consider redundant top-of-rack and core switches
  • 56. Operating System Recommendations • Choose an OS you’re comfortable administering • CentOS: geared towards servers rather than individual workstations – Conservative about package versions – Very widely used in production • RedHat Enterprise Linux (RHEL): RedHat-supported analog to CentOS – Includes support contracts, for a price • In production, we often see a mixture of RHEL and CentOS machines – Often RHEL on master nodes, CentOS on slaves
  • 57. Configuring The System • Do not use Linux’s LVM (Logical Volume Manager) to make all your disks appear as a single volume • – As with RAID 0, this limits speed to that of the slowest disk • Check the machines’ BIOS* settings – BIOS settings may not be configured for optimal performance – For example, if you have SATA drives make sure IDE emulation is not enabled • Test disk I/O speed with hdparm -t – Example: hdparm -t /dev/sda1 – You should see speeds of 70MB/sec or more Anything less is an indication of possible problems
  • 58. Configuring The System Hadoop has no specific disk partitioning requirements – Use whatever partitioning system makes sense to you Mount disks with the noatime option Common directory structure for data mount points: /data/<n>/dfs/nn /data/<n>/dfs/dn /data/<n>/dfs/snn /data/<n>/mapred/local Reduce the swappiness of the system – Set vm.swappiness to 0 or 5 in /etc/sysctl.conf 15/12/13
  • 59. Filesystem Considerations • Cloudera recommends the ext3 and ext4 filesystems – ext4 is now becoming more commonly used • XFS provides some performance benefit during kickstart – It formats in 0 seconds, vs several minutes for each disk with ext3 • XFS has some performance issues – Slow deletes in some versions – Some performance improvements are available; see e.g., http://everything2.com/index.pl?node_id=1479435 – Some versions had problems when a machine runs out of memory
  • 60. Operating System Parameters • Increase the nofile ulimit for the mapred and hdfs users to at least 32K – Setting is in /etc/security/limits.conf • Disable IPv6 • Disable SELinux • Install and configure the ntp daemon – Ensures the time on all nodes is synchronized – Important for HBase – Useful when using logs to debug problems
  • 61. Java Virtual Machine (JVM) Recommendations • Always use the official Oracle JDK (http://java.com/) – Hadoop is complex software, and often exposes bugs in other JDK implementations • Version 1.6 is required – Avoid 1.6.0u18 – This version had significant bugs • Hadoop is not yet production-tested with Java 7 (1.7) • Recommendation: don’t upgrade to a new version as soon as it is released – Wait until it has been tested for some time
  • 63. For easy installation  Cloudera has released Cloudera Manager (CM), a tool for easy deployment and configuration of Hadoop clusters  The free version, Cloudera Manager Free Edition, can manage up to 50 nodes – The version supplied with Cloudera Enterprise supports an unlimited number of nodes
  • 64. Using Cloudera Manager Free Edition
  • 66. Hadoop's Configuration Files  Each machine in the Hadoop cluster has its own set of configuration files  Configuration files all reside in Hadoop’s conf directory – Typically /etc/hadoop/conf  Primary configuration files are written in XML
  • 67. Sample Configuration File Sample configuration file (mapred-site.xml) <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration>
  • 69. hdfs-site.xml  The single most important configuration value on your entire cluster, set on the NameNode: * Loss of the NameNode’s metadata will result in the effective loss of all the data on the cluster – Although the blocks will remain, there is no way of reconstructing the original files without the metadata * This must be at least two disks (or a RAID volume) on the NameNode, plus an NFS mount elsewhere on the network – Failure to set this correctly will result in eventual loss of your cluster’s data
  • 71. Additional Configuration Files There are several more configuration files in /etc/hadoop/conf – hadoop-env.sh: environment variables for Hadoop daemons – HDFS and MapReduce include/exclude files * Controls who can connect to the NameNode and JobTracker – masters, slaves: hostname lists for ssh control – hadoop-policy.xml: Access control policies – log4j.properties: logging (covered later in the course) – fair-scheduler.xml: Scheduler (covered later in the course) – hadoop-metrics.properties: Monitoring (covered later in the course) 
  • 72. Environment Setup: hadoop-env.sh HADOOP_HEAPSIZE – Controls the heap size for Hadoop daemons – Default 1GB – Comment this out, and set the heap for individual daemons HADOOP_NAMENODE_OPTS – Java options for the NameNode – At least 4GB: -Xmx4g HADOOP_JOBTRACKER_OPTS – Java options for the JobTracker – At least 4GB: -Xmx4g HADOOP_DATANODE_OPTS, HADOOP_TASKTRACKER_OPTS – Set to 1GB each: -Xmx1g 15/12/13
  • 73. Host 'include' and 'exclude' Files  Optionally, specify dfs.hosts in hdfs-site.xml to point to a file listing hosts which are allowed to connect to the NameNode and act as DataNodes – Similarly, mapred.hosts points to a file which lists hosts allowedto connect as TaskTrackers  Both files are optional – If omitted, any host may connect and act as a DataNode/ TaskTracker – This is a possible security/data integrity issue  NameNode can be forced to reread the dfs.hosts file with hadoop dfsadmin -refreshNodes – No such command for the JobTracker, which has to be restarted to re-read the mapred.hosts file, so many System Administrators only create a dfs.hosts file
  • 75. Displaying Running Jobs • To view all jobs running on the cluster, use # hadoop job –list
  • 76. Displaying All Jobs • To display all jobs including completed jobs, use # hadoop job -list all
  • 77. Killing a Job • It is important to note that once a user has submitted a job, they can not stop it just by hitting CTRL-C on their terminal – This stops job output appearing on the user’s console – The job is still running on the cluster!
  • 78. Killing a Job To kill a job use hadoop job -kill <job_id> 15/12/13
  • 81. ???