1. Search for a CentOS6.4 x86_64 with updates instance from AWS marketplace.
https://aws.amazon.com/marketplace The same search from the AWS Instances link doesn't return the
same results.
2.
3.
4.
5.
6. Login as root:
[dc@localhost Downloads]$ ssh -i bigtop2.pem root@ec2-50-19-133-93.compute-1.amazonaws.com
The authenticity of host 'ec2-50-19-133-93.compute-1.amazonaws.com (50.19.133.93)' can't be
established.
RSA key fingerprint is ee:90:19:6d:67:44:1e:a8:85:0d:7f:03:35:21:42:8c.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-50-19-133-93.compute-1.amazonaws.com,50.19.133.93' (RSA) to
the list of known hosts.
[root@ip-10-147-220-207 ~]#
Create a user and password:
[root@ip-10-147-220-207 ~]# useradd dc
[root@ip-10-147-220-207 ~]# passwd dc
Changing password for user dc.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
Install an editor as root
>yum install nano
Edit sudoers file:
## Allow root to run any commands anywhere
root ALL=(ALL) ALL
dc ALL=(ALL) ALL
Add the line in bold with the user name in bold. Modify to your user name.
Change to the user and go to the user home directory.
[root@ip-10-147-220-207 ~]# su dc
[dc@ip-10-147-220-207 root]$ cd
[dc@ip-10-147-220-207 ~]$ pwd
/home/dc
[dc@ip-10-147-220-207 ~]$
7. Apache Bigtop is the project where all the Hadoop components are combined. This is used as a starting
point for Hortonworks and Cloudera to build their distributions.
Follow the instructions on the Apache Bigtop install page here:
https://cwiki.apache.org/confluence/display/BIGTOP/How+to+install+Hadoop+distribution+from+Big
top+0.6.0
Replace 0.6.0 with 0.7.0 and select Centos6.4 on the bigtop repo install command.
sudo wget -O /etc/yum.repos.d/bigtop.repo http://www.apache.org/dist/bigtop/bigtop-
0.7.0/repos/centos6/bigtop.repo
this copies the bigtop.repo file into /etc/yum.repos.d/bigtop.repo
run
>sudo yum update
[dc@ip-10-147-220-207 ~]$ sudo yum install hadoop*
Install a JDK and set JDK_HOME. The old versions of Hadoop required JDK1.6-X. There are patches
you can apply to upgrade to JDK1.7-x.
>sudo wget https://s3.amazonaws.com/victormongo/jdk-6u25-linux-x64.bin.1
[dc@ip-10-147-220-207 ~]$ ls
jdk-6u25-linux-x64.bin.1
[dc@ip-10-147-220-207 ~]$ sudo chmod 777 jdk-6u25-linux-x64.bin.1
[dc@ip-10-147-220-207 ~]$ ls
jdk-6u25-linux-x64.bin.1
[dc@ip-10-147-220-207 ~]$
run the executable: ./jdk-6u25-linux-x64.bin.1
[dc@ip-10-147-220-207 ~]$ sudo mkdir /usr/java
[dc@ip-10-147-220-207 ~]$ sudo mv jdk1.6.0_25/ /usr/java
[sudo] password for dc:
[dc@ip-10-147-220-207 ~]$ cd /usr/java
[dc@ip-10-147-220-207 java]$ ls
jdk1.6.0_25
[dc@ip-10-147-220-207 java]$ sudo ln -s /usr/java/jdk1.6.0_25/ /usr/java/latest
[dc@ip-10-147-220-207 java]$ ls -al
total 12
drwxr-xr-x. 3 root root 4096 Mar 18 19:04 .
drwxr-xr-x. 14 root root 4096 Mar 18 18:58 ..
8. drwxr-xr-x. 10 dc dc 4096 Mar 18 19:03 jdk1.6.0_25
lrwxrwxrwx. 1 root root 22 Mar 18 19:04 latest -> /usr/java/jdk1.6.0_25/
[dc@ip-10-147-220-207 java]$
[dc@ip-10-147-220-207 java]$ sudo ln -s /usr/java/latest /usr/java/default
[dc@ip-10-147-220-207 java]$ ls -al
total 12
drwxr-xr-x. 3 root root 4096 Mar 18 19:05 .
drwxr-xr-x. 14 root root 4096 Mar 18 18:58 ..
lrwxrwxrwx. 1 root root 16 Mar 18 19:05 default -> /usr/java/latest
drwxr-xr-x. 10 dc dc 4096 Mar 18 19:03 jdk1.6.0_25
lrwxrwxrwx. 1 root root 22 Mar 18 19:04 latest -> /usr/java/jdk1.6.0_25/
[dc@ip-10-147-220-207 java]$
Edit .bashrc and add JDK_HOME
GNU nano 2.0.9 File: .bashrc Modified
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
[dc@ip-10-147-220-207 ~]$ source .bashrc
verify Java is installed
[dc@ip-10-147-220-207 ~]$ java -version
java version "1.6.0_25"
Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)
[dc@ip-10-147-220-207 ~]$
Install Hadoop:
>sudo yum install hadoop*
Format the namenode
9. sudo /etc/init.d/hadoop-hdfs-namenode init
Start the daemons, the namenode and datanodes in pseudo-distributed mode.
[dc@ip-10-147-220-207 ~]$ for i in hadoop-hdfs-namenode hadoop-hdfs-datanode ; do sudo service $i
start ; done
Starting Hadoop namenode: [ OK ]
starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-10-147-220-207.out
Starting Hadoop datanode: [ OK ]
starting datanode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-147-220-207.out
Now create the file directories for the Hadoop components.
Execute the following commands:
hadoop fs -ls -R /
sudo -u hdfs hadoop fs -mkdir -p /user/$USER
[dc@ip-10-147-220-207 ~]$ hadoop fs -ls -R /
drwxr-xr-x - hdfs supergroup 0 2014-03-18 19:55 /user
drwxr-xr-x - hdfs supergroup 0 2014-03-18 19:55 /user/dc
sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER
[dc@ip-10-147-220-207 ~]$ sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER
[dc@ip-10-147-220-207 ~]$ sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER
[dc@ip-10-147-220-207 ~]$ hadoop fs -ls -R /
drwxr-xr-x - hdfs supergroup 0 2014-03-18 19:55 /user
drwxr-xr-x - dc dc 0 2014-03-18 19:55 /user/dc
[dc@ip-10-147-220-207 ~]$
sudo -u hdfs hadoop fs -chmod 770 /user/$USER
[dc@ip-10-147-220-207 ~]$ hadoop fs -ls -R /
drwxr-xr-x - hdfs supergroup 0 2014-03-18 19:55 /user
drwxrwx--- - dc dc 0 2014-03-18 19:55 /user/dc
[dc@ip-10-147-220-207 ~]$
sudo -u hdfs hadoop fs -mkdir /tmp
10. sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -mkdir -p /user/history
sudo -u hdfs hadoop fs -chown mapred:mapred /user/history
sudo -u hdfs hadoop fs -chmod 770 /user/history
sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging
sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-
yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-
yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging
There are a more subdirectories to create for the Hadoop components. Run the hdfs-init script...
sudo /usr/lib/hadoop/libexec/init-hdfs.sh
Start the YARN daemons
[dc@ip-10-147-220-207 ~]$ sudo service hadoop-yarn-resourcemanager start
Starting Hadoop resourcemanager: [ OK ]
starting resourcemanager, logging to /var/log/hadoop-yarn/yarn-yarn-resourcemanager-ip-10-147-220-
207.out
[dc@ip-10-147-220-207 ~]$ sudo service hadoop-yarn-nodemanager start
Starting Hadoop nodemanager: [ OK ]
starting nodemanager, logging to /var/log/hadoop-yarn/yarn-yarn-nodemanager-ip-10-147-220-207.out
[dc@ip-10-147-220-207 ~]$
[dc@ip-10-147-220-207 ~]$sudo -u hdfs hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-
examples.jar pi 10 1000
Number of Maps = 10
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
11. Wrote input for Map #9
Starting Job
14/03/18 20:03:04 INFO service.AbstractService:
Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/03/18 20:03:04 INFO service.AbstractService:
Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/03/18 20:03:04 INFO input.FileInputFormat: Total input paths to process : 10
14/03/18 20:03:04 INFO mapreduce.JobSubmitter: number of splits:10
14/03/18 20:03:04 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/03/18 20:03:04 WARN conf.Configuration: mapred.map.tasks.speculative.execution is deprecated.
Instead, use mapreduce.map.speculative
14/03/18 20:03:04 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use
mapreduce.job.reduces
14/03/18 20:03:04 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use
mapreduce.job.output.value.class
14/03/18 20:03:04 WARN conf.Configuration: mapred.reduce.tasks.speculative.execution is
deprecated. Instead, use mapreduce.reduce.speculative
14/03/18 20:03:04 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use
mapreduce.job.map.class
14/03/18 20:03:04 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use
mapreduce.job.name
14/03/18 20:03:04 WARN conf.Configuration: mapreduce.reduce.class is deprecated. Instead, use
mapreduce.job.reduce.class
14/03/18 20:03:04 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use
mapreduce.job.inputformat.class
14/03/18 20:03:04 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use
mapreduce.input.fileinputformat.inputdir
14/03/18 20:03:04 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use
mapreduce.output.fileoutputformat.outputdir
14/03/18 20:03:04 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead,
use mapreduce.job.outputformat.class
14/03/18 20:03:04 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use
mapreduce.job.maps
14/03/18 20:03:04 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use
mapreduce.job.output.key.class
14/03/18 20:03:04 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use
mapreduce.job.working.dir
14/03/18 20:03:05 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1395172880059_0001
14/03/18 20:03:05 INFO client.YarnClientImpl: Submitted application
application_1395172880059_0001 to ResourceManager at /0.0.0.0:8032
14/03/18 20:03:05 INFO mapreduce.Job: The url to track the job: http://ip-10-147-220-
207:8088/proxy/application_1395172880059_0001/
14/03/18 20:03:05 INFO mapreduce.Job: Running job: job_1395172880059_0001
12. 14/03/18 20:04:41 INFO mapreduce.Job: Job job_1395172880059_0001 completed successfully
14/03/18 20:04:41 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=226
FILE: Number of bytes written=814354
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2610
HDFS: Number of bytes written=215
HDFS: Number of read operations=43
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=10
Launched reduce tasks=1
Rack-local map tasks=10
Total time spent by all maps in occupied slots (ms)=393538
Total time spent by all reduces in occupied slots (ms)=36734
Map-Reduce Framework
Map input records=10
Map output records=20
Map output bytes=180
Map output materialized bytes=280
Input split bytes=1430
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=280
Reduce input records=20
Reduce output records=0
Spilled Records=40
Shuffled Maps =10
Failed Shuffles=0
Merged Map outputs=10
GC time elapsed (ms)=6221
CPU time spent (ms)=7220
Physical memory (bytes) snapshot=2499035136
Virtual memory (bytes) snapshot=6820196352
Total committed heap usage (bytes)=1643577344
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1180
13. File Output Format Counters
Bytes Written=97
Job Finished in 97.444 seconds
Estimated value of Pi is 3.14080000000000000000
The single node is now running both HDFS and MR/YARN. Start 2 other nodes and verify these work
in pseudo-distrributed mode.
Make sure the 2 instances launch in the same region/availability
zone
Create users, do not install using root and install Hadoop using the yum repo procedure listed above.
Stopping and starting AWS instances:
Start/Stop the instances from AWS manager.
ssh, su $USERNAME, cd
To view the Namenode UI,
disable selinux and turn off iptables.
1) Change /etc/selinux/config from enforcing to disabled and reboot.
2) Turn off iptables using > sudo service iptables stop
To prevent the iptables services from starting use >sudo chkconfig iptables stop
Verify by using >sudo chkconfig –list iptables
[dc@ip-10-85-91-254 ~]$ sudo chkconfig --list iptables
iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
Normally iptables look like:
[root@ip-10-85-31-193 ~]# sudo chkconfig --list iptables
14. iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off
[root@ip-10-85-31-193 ~]#
Converting to Distributed mode from Pseudo-distributed
Setup up networking and ssh. Make sure you can ping each server and ssh into each other node.
Label each instance with nn, dn1, dn2, dn3. We are going to setup a cluster with a nn and 3 datanodes.
Label where each service is started so you know which server is running which service.
Ifconfig lists the private DNS addresses:
[dc@ip-10-85-91-254 ~]$ ifconfig
eth0 Link encap:Ethernet HWaddr 12:31:3D:15:4C:10
inet addr:10.85.91.254 Bcast:10.85.91.255 Mask:255.255.254.0
inet6 addr: fe80::1031:3dff:fe15:4c10/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2897 errors:0 dropped:0 overruns:0 frame:0
TX packets:2377 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:245482 (239.7 KiB) TX bytes:260672 (254.5 KiB)
Interrupt:247
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:48416 errors:0 dropped:0 overruns:0 frame:0
TX packets:48416 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:6738928 (6.4 MiB) TX bytes:6738928 (6.4 MiB)
Add DNS names which correspond to the text files just created.
The /etc/hosts files on all 3 instances should be the same:
[dc@ip-10-62-122-8 ~]$ cat /etc/hosts
10.85.91.254 dn1
10.85.31.193 nn
10.62.122.8 dn3
127.0.0.1 localhost.localdomain localhost
15. ::1 localhost6.localdomain6 localhost6
Ping 3 nodes from each node to verify the setup is correct.
Ping nn;ping dn1; ping dn3
Set replication factor to 3
Start datanodes services
Start namenode datanodes
Verify 3 nodes are up using the namenode UI:
16. Verify blocks replicated:
[dc@ip-10-85-31-193 hadoop-hdfs]$ sudo -u hdfs hadoop fsck /testdir/testfile -files -blocks
[dc@ip-10-85-31-193 hadoop-hdfs]$ sudo -u hdfs hadoop fsck /testdir/testfile -files -blocks
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Connecting to namenode via http://nn:50070
FSCK started by hdfs (auth:SIMPLE) from /10.85.31.193 for path /testdir/testfile at Wed Mar 19
20:48:46 UTC 2014
/testdir/testfile 126 bytes, 1 block(s): OK
0. BP-1370670758-10.85.31.193-1395261360244:blk_2735064026059661308_1002 len=126 repl=3
Status: HEALTHY
Total size: 126 B
Total dirs: 0
Total files: 1
Total blocks (validated): 1 (avg. block size 126 B)
Minimally replicated blocks:1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Wed Mar 19 20:48:46 UTC 2014 in 1 milliseconds
The filesystem under path '/testdir/testfile' is HEALTHY
[dc@ip-10-85-31-193 hadoop-hdfs]$ ls
hadoop-hdfs-datanode-ip-10-85-31-193.log
hadoop-hdfs-datanode-ip-10-85-31-193.out
hadoop-hdfs-namenode-ip-10-85-31-193.log
hadoop-hdfs-namenode-ip-10-85-31-193.out