Big Data Hadoop Hands On Workshop on Amazon EMR

Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1
Big Data using Hadoop
On Amazon Elastic MapReduce
Hands On Workshop
Dr.Thanachart Numnonda
thanachart@imcinstitute.com
Danairat T.
Certified Java Programmer, TOGAF – Silver
danairat@gmail.com, +66-81-559-1446

Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture: Big Data Development
Process

Big Data Development Process Guideline
Architecture
Planning
•
Targeted Users
•
Target Opportunities
•
Data Scientist
•
Data Source/Type
•
Data Capturing
Approach
•
Data Processing and
Visualize Planning
•
Technology
Architecture
•
Big Data
EcoSystem
•
(Hadoop
Ecosystem)
•
Sizing
•
Integration
•
Security
•
Administration and
Operation Planning
Big Data
Development
•
Develop Use Cases
•
Set up Big Data
Pseudo-distribution
Mode
•
Set up HDFS
•
Develop Data
Capturing System
•
Develop Data
Analytic
•
Map Reduce
•
Hive
•
R
•
Etc.
•
Integrate result to
Enterprise Analytic
System
•
Set up Big Data
Cluster Mode
Operation and
Support
•
Monitor HDFS
utilization and
capacity planning
•
Monitor Job Tracker
availability
•
Monitor Data
Capturing System
•
Upgrade or Patch
Big Data Hadoop
ecosystem
•
System admin.
Training
•
Helpdesk Training
•
End-User Training
(Analytic Results)
System
Evaluation
•
Adoption Rates for
each analytics results
•
No. of Missing Analytic
Results
•
No. of Missing Data
•
Lost hours per month
•
Avg. of each Analytic
Result Response Time
•
No. of Technology
System Failure per
month

Hands-On: Running Hadoop
on Local Mode

Hadoop Installation
Hadoop provides three installation choices:
●
Local mode: This is an unzip and run mode to get
you started right away where allparts of Hadoop
run within the same JVM
●
Pseudo distributed mode: This mode will be run
on different parts of Hadoop as different Java
processors, but within a single machine
●
Distributed mode: This is the real setup that
spans multiple machines

Installing Hadoop and Ecosystem
1. Installing Virutal Box or VMWare Player
2. Running Image File
3. Start Hadoop
4. Hadoop Web Console
5. Stop Hadoop
Notes:-
Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4
stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will
encounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6

MapReduce
(Job Scheduling/Execution System)
HDFS
(Hadoop Distributed File System)
Pig Sqoop
HBase
Hive
Hadoop's Ecosystem in the VM

Starting Hadoop
[hdadmin@localhost hadoop]$ /usr/local/hadoop/bin/start-all.sh
Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
[hdadmin@localhost hadoop]$ /usr/lib/jvm/jdk1.6.0_39/bin/jps
11567 Jps
10766 NameNode
11099 JobTracker
11221 TaskTracker
10899 DataNode
11018 SecondaryNameNode
[hdadmin@localhost hadoop]$
Checking Java Process and you are now running Hadoop as pseudo distributed mode

Hadoop is up!

Stopping Hadoop
[hdadmin@localhost hadoop]$ /usr/local/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

Hands-On: Importing Data to HDFS
using Hadoop Command Line

Importing Data to Hadoop
Creating new file in /tmp
$ vi /tmp/input_test.txt
GNOME Terminal is a terminal emulation application that you can use to
perform the following tasks:
Access a UNIX shell in the GNOME environment
A shell is a program that interprets and executes the commands that you
type at a command line prompt. When you start GNOME Terminal, the
application starts the default shell that is specified in your system
account. You can switch to a different shell at any time.
Typing for the text file, Please type your own data
$hadoop dfs -mkdir /input
$hadoop dfs -mkdir /output
$hadoop dfs -copyFromLocal /tmp/input_test.txt /input

Hands-On: Reviewing, Retrieving,
Deleting Data from HDFS

Review file in Hadoop HDFS
[hdadmin@localhost bin]$ hadoop dfs -ls /input
Found 1 items
-rw-r--r-- 1 hdadmin supergroup 1016 2013-03-13 20:11 /input/input_test.txt
[hdadmin@localhost bin]$ hadoop dfs -cat /input/input_test.txt
List HDFS File
Read HDFS File
Retrieve HDFS File to Local File System
Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
[hdadmin@localhost bin]$ hadoop dfs -copyToLocal /input/input_test.txt /tmp/file.txt

Review file in Hadoop HDFS using
WebUIhttp://localhost:50070/

Review file in Hadoop HDFS using
WebUI

Review file in Hadoop HDFS using WebUI
Scroll Down

Hadoop Port Numbers
Daemon Default
Port
Configuration Parameter in
conf/*-site.xml
HDFS Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
MR JobTracker 50030 mapred.job.tracker.http.addre
ss
Tasktrackers 50060 mapred.task.tracker.http.addr
ess

Review Content from System shell
[hdadmin@localhost current]$ cd /app/hadoop/tmp/dfs/data/current
[hdadmin@localhost current]$ ls -l
total 24
-rw-r--r--. 1 hdadmin hadoop 1016 Mar 13 20:11 blk_1997667773574667398
-rw-r--r--. 1 hdadmin hadoop 15 Mar 13 20:11 blk_1997667773574667398_1005.meta
-rw-r--r--. 1 hdadmin hadoop 4 Mar 13 20:04 blk_-6735227193197163844
-rw-r--r--. 1 hdadmin hadoop 11 Mar 13 20:04 blk_-6735227193197163844_1004.meta
-rw-r--r--. 1 hdadmin hadoop 482 Mar 13 20:18 dncp_block_verification.log.curr
-rw-r--r--. 1 hdadmin hadoop 154 Mar 13 20:03 VERSION
[hdadmin@localhost current]$ more blk_1997667773574667398
GNOME Terminal is a terminal emulation application that you can use to perform the following tasks:
Access a UNIX shell in the GNOME environment
A shell is a program that interprets and executes the commands that you type at a
command lin
e prompt. When you start GNOME Terminal, the application starts the default shell that is specified
in your system account. You can switch to a different shell at any time.
[hdadmin@localhost current]$

Removing data from HDFS using
Shell Command
hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt
Deleted hdfs://localhost:54310/input/input_test.txt
hdadmin@localhost detach]$

Hands-On: Running Hadoop
on Amazon Elastic MapReduce

Architecture Overview of Amazon EMR

Creating an AWS account

Signing up for the necessary services
●
Simple Storage Service (S3)
●
Elastic Compute Cloud (EC2)
●
Elastic MapReduce (EMR)
Caution! This costs real money!

Creating Amazon S3 bucket

Create access key using Security Credentials
in the AWS Management Console

Creating a new Job Flow in EMR

View Result from the S3 bucket

Lecture: Understanding Map Reduce
Processing
Client
Name Node Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Map Reduce

MapReduce Framework
map: (K1, V1) -> list(K2, V2))
reduce: (K2, list(V2)) -> list(K3, V3)

MapReduce Processing – The Data
flow
1. InputFormat, InputSplits, RecordReader
2. Mapper - your focus is here
3. Partition, Shuffle & Sort
4. Reducer - your focus is here
5. OutputFormat, RecordWriter

How does the MapReduce work?
Output in a list of (Key, List of Values)
in the intermediate file
Sorting
Partitioning
Output in a list of (Key, Value)
InputSplit
RecordReader
RecordWriter

How does the MapReduce work?
Sorting
Partitioning
Combining
Car, 2
Car, 2
Bear, {1,1}
Car, {2,1}
River, {1,1}
Deer, {1,1}
Output in a list of (Key, List of Values)
Output in a list of (Key, Value)
InputSplit
RecordReader
RecordWriter

InputFormat
InputFormat: Description: Key: Value:
TextInputFormat
Default format; reads
lines of text files
The byte offset of the
line
The line contents
KeyValueInputFormat
Parses lines into key,
val pairs
Everything up to the
first tab character
The remainder of the
line
SequenceFileInputFor
mat
A Hadoop-specific
high-performance
binary format
user-defined user-defined

InputSplit
An InputSplit describes a unit of work that comprises a single map
task.
InputSplit presents a byte-oriented view of the input.
You can control this value by setting the mapred.min.split.size
parameter in core-site.xml, or by overriding the parameter in the
JobConf object used to submit a particular MapReduce job.
RecordReader
RecordReader reads <key, value> pairs from an InputSplit.
Typically the RecordReader converts the byte-oriented view of
the input, provided by the InputSplit, and presents a record-
oriented to the Mapper

Mapper
Mapper: The Mapper performs the user-defined logic to the input a
key, value and emits (key, value) pair(s) which are forwarded to the
Reducers.
Partition, Shuffle & Sort
After the first map tasks have completed, the nodes may still be
performing several more map tasks each. But they also begin
exchanging the intermediate outputs from the map tasks to where they
are required by the reducers.
Partitioner controls the partitioning of map-outputs to assign to reduce
task . he total number of partitions is the same as the number of reduce
tasks for the job
The set of intermediate keys on a single node is automatically sorted
by internal Hadoop before they are presented to the Reducer
This process of moving map outputs to the reducers is known as
shuffling.

Reducer
This is an instance of user-provided code that performs read each
key, iterator of values in the partition assigned. The OutputCollector
object in Reducer phase has a method named collect() which will
collect a (key, value) output.
OutputFormat, Record Writer
OutputFormat governs the writing format in OutputCollector and
RecordWriter writes output into HDFS.
OutputFormat: Description
TextOutputFormat
Default; writes lines in "key t value"
form
SequenceFileOutputFormat
Writes binary files suitable for
reading into subsequent MapReduce
jobs
NullOutputFormat generates no output files

Hands-On: Writing you own Map
Reduce Program

Wordcount (HelloWord in Hadoop)
1. package org.myorg;
2.
3. import java.io.IOException;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14.
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {
15. private final static IntWritable one = new IntWritable(1);
16. private Text word = new Text();
17.
18.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
19. String line = value.toString();
20. StringTokenizer tokenizer = new StringTokenizer(line);
21. while (tokenizer.hasMoreTokens()) {
22. word.set(tokenizer.nextToken());
23. output.collect(word, one);
24. }
25. }
26. }

27.
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,
IntWritable> {
29.
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
30. int sum = 0;
31. while (values.hasNext()) {
32. sum += values.next().get();
33. }
34. output.collect(key, new IntWritable(sum));
35. }
36. }
37.

38. public static void main(String[] args) throws Exception {
39. JobConf conf = new JobConf(WordCount.class);
40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46.
47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class);
50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[0]));
53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
54.
55. JobClient.runJob(conf);
57. }
58. }
59.

Hands-On: Packaging Map Reduce
and Deploying to Hadoop Runtime
Environment

Packaging Map Reduce Program
Usage
Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version
installed, compile WordCount.java and create a jar:
$ mkdir /home/hduser/wordcount_classes
$ cd /home/hduser
$ javac -classpath /usr/local/hadoop/hadoop-core-0.20.205.0.jar -d wordcount_classes WordCount.java
$ jar -cvf ./wordcount.jar -C wordcount_classes/ .
$ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir
Output:
…….
$ hadoop dfs -cat /output/wordcount_output_dir/part-00000

Reviewing MapReduce Output Result
Scroll Down
the web page

Reviewing MapReduce Output Result

Hands-On: Running WordCount.jar
on Amazon EMR

Upload .jar file and input file to
Amazon S3
1. Select <yourbucket> in Amazon S3 service
2. Create folder : applications
3. Upload wordcount.jar to the applications folder
4. Create another folder: input
5. Upload input_test.txt to the input folder

Create a new Job Flow in EMR

Input JAR Location and Arguments

View the Result

Lecture
Understanding Hive

Introduction
A Petabyte Scale Data Warehouse Using Hadoop
Hive is developed by Facebook, designed to enable easy data
summarization, ad-hoc querying and analysis of large
volumes of data. It provides a simple query language called
Hive QL, which is based on SQL

What Hive is NOT
Hive is not designed for online transaction processing and
does not offer real-time queries and row level updates. It is
best used for batch jobs over large sets of immutable data
(like web logs, etc.).

System Architecture and Components
•
Metastore: To store the meta data.
•
Query compiler and execution engine: To convert SQL queries to a
sequence of map/reduce jobs that are then executed on Hadoop.
•
SerDe and ObjectInspectors: Programmable interfaces and
implementations of common data formats and types.
A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java
object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.
•
UDF and UDAF: Programmable interfaces and implementations for
user defined functions (scalar and aggregate functions).
•
Clients: Command line client similar to Mysql command line.
hive.apache.org

Architecture Overview
HDFS
Hive CLI
QueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.
WebUI
HDFS
DDL
Hive
Hive.apache.org

Sample HiveQL
The Query compiler uses the information stored in the metastore to
convert SQL queries into a sequence of map/reduce jobs, e.g. the
following query
SELECT * FROM t where t.c = 'xyz'
SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)
SELECT t1.c1, count(1) from t1 group by t1.c1
Hive.apache.org

Running Hive
Hive Shell
●
Interactive
hive
●
Script
hive -f myscript
●
Inline
hive -e 'SELECT * FROM mytable'
Hive.apache.org

Hands-On: Creating Table and
Retrieving Data using Hive

Hive Hands-On Labs
1. Creating Hive Table
2. Reviewing Hive Table in HDFS
3. Alter and Drop Hive Table
4. Loading Data to Hive Table
5. Querying Data from Hive Table
6. Reviewing Hive Table Content from HDFS Command
and WebUI
7. Insert Overwriting the Hive Table

Starting Hive
Re-Start Hive CLI again
$ hive
Logging initialized using configuration in file:/usr/local/hive-
0.9.0-bin/conf/hive-log4j.properties
Hive history
file=/tmp/hdadmin/hive_job_log_hdadmin_201303171635_1944738265.txt
hive>
hive> quit;
Quit from Hive

1. Creating Hive Table
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 4.069 seconds
hive (default)> show tables;
OK
test_tbl
hive (default)> describe test_tbl;
OK
id int
country string
hive (default)>
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html

2. Reviewing Hive Table in HDFS
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl
[hdadmin@localhost hdadmin]$
Review Hive Table from
HDFS WebUI

hive (default)> alter table test_tbl add columns (remarks STRING);
hive (default)> describe test_tbl;
OK
id int
country string
remarks string
hive (default)> drop table test_tbl;
OK
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html

CREATE EXTERNAL TABLE weblog_entries (
ip STRING, dash1 STRING, dash2 STRING,
date STRING,status1 STRING, getstr STRING,
link STRING,http STRING,
Status STRING,
size INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY
'n'
LOCATION '/data/';
weblog.hsql
hive –f weblog_create_external_table.hql
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html

4. Loading Data to Hive Table
$ hive
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Creating Hive table
hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLE
test_tbl;
Copying data from file:/tmp/test_tbl_data.csv
Copying file: file:/tmp/test_tbl_data.csv
Loading data to table default.test_tbl
OK
hive (default)>
Loading data to Hive table

hive (default)> select * from test_tbl;
OK
1 USA
62 Indonesia
63 Philippines
65 Singapore
66 Thailand
hive (default)>

hive (default)> select country from test_tbl;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201303171733_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303171733_0001
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201303171733_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-03-17 18:13:19,097 Stage-1 map = 0%, reduce = 0%
2013-03-17 18:13:25,151 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec
MapReduce Total cumulative CPU time: 250 msec
Ended Job = job_201303171733_0001
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.25 sec HDFS Read: 282 HDFS Write: 45 SUCCESS
Total MapReduce CPU Time Spent: 250 msec
OK
USA
Indonesia
Philippines
Singapore
Thailand

6. Reviewing Hive Table Content from HDFS Command and WebUI
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl
Found 1 items
-rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08
/user/hive/warehouse/test_tbl/test_tbl_data.csv
[hdadmin@localhost hdadmin]$ hadoop fs -cat
/user/hive/warehouse/test_tbl/test_tbl_data.csv
1,USA
62,Indonesia
63,Philippines
65,Singapore
66,Thailand

7. Insert Overwriting the Hive Table
hive (default)> LOAD DATA LOCAL INPATH
'/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl;
Copying data from file:/tmp/test_tbl_data_updated.csv
Copying file: file:/tmp/test_tbl_data_updated.csv
Loading data to table default.test_tbl
Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl
OK
hive (default)>

Review Hive Table Created in HDFS and WebUI
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl
Found 1 items
-rw-r--r-- 1 hdadmin supergroup 3510 2013-03-17 18:25
/user/hive/warehouse/test_tbl/test_tbl_data_updated.csv
[hdadmin@localhost hdadmin]$ hadoop fs -cat
/user/hive/warehouse/test_tbl/test_tbl_data_updated.csv
93,Afghanistan
355,Albania
213,Algeria
1684,AmericanSamoa
376,Andorra
244,Angola
1264,Anguilla
672,Antarctica
1268,AntiguaandBarbuda
54,Argentina
374,Armenia
297,Aruba
61,Australia
43,Austria
994,Azerbaijan
1242,Bahamas
973,Bahrain

Hands-On: Install the Amazon EMR
Command Line Interface

Installing Amazon EMR CLI
1. Install Ruby
2. Download the Amazon EMR CLI
3. Install the Amazon EMR CLI
4. Create your credentials file (credentials.json)
5. Create an Amazon EC2 key pair
6. Configure your SSH credentials
7. Verify installation of the Amazon EMR CL
Instruction:
http://docs.aws.amazon.com/ElasticMapReduce/latest/
DeveloperGuide/emr-cli-install.html

Example: Credentials file
{
"access_id": "AKI..........................A",
"private_key": "SaJHI4wjyK.............UWDaYOw2el",
"keypair": "imckey",
"key-pair-file": "~/elastic-mapreduce-cli/imckey.pem",
"log_uri": "s3n://imcbucket/",
"region": "us-west-2"
}

Running Amazon EMR CLI
THANACHARTs-MacBook-Air:~ THANACHART$ cd elastic-mapreduce-cli/
THANACHARTs-MacBook-Air:elastic-mapreduce-cli THANACHART$
THANACHARTs-MacBook-Air:elastic-mapreduce-ruby THANACHART$
./elastic-mapreduce --list
j-2JW8QBWXIYNV8 TERMINATED ec2-54-213-112-102.us-west-
2.compute.amazonaws.comHBase CLI
COMPLETED Start HBase
j-1JNA9G1O7ET2G TERMINATED ec2-54-213-112-74.us-west-
2.compute.amazonaws.com Hive Interactive2
COMPLETED Setup Hive
j-1H7NX8OGFNFRW TERMINATED ec2-54-213-10-135.us-west-
2.compute.amazonaws.com Hive Interactive

Hands-On: Running Hive Interactive
on Amazon EMR

Running Hive on Amazon EMR
●
Amazon EMR enables you to run Hive scripts in two
modes:
●
Interactive
●
Batch
Hive.apache.org

Upload an input file to Amazon S3
2. Create afolder:data
3. Upload hdi-data.csv to the data folder

Running Hive Interactive

Select EC2 Key Pair

Find Job Flow ID

Running CLI to check the Job Flow
$ ./elastic-mapreduce --list -j j-37WK3Z1T2FZ7D
j-37WK3Z1T2FZ7D STARTING ec2-54-213-119-89.us-west-
2.compute.amazonaws.com Hive Interactive Demo
PENDING Setup Hive
$ ./elastic-mapreduce --list -j j-37WK3Z1T2FZ7D
j-37WK3Z1T2FZ7D RUNNING ec2-54-213-119-89.us-west-
2.compute.amazonaws.com Hive Interactive Demo
RUNNING Setup Hive
$ ./elastic-mapreduce --ssh j-37WK3Z1T2FZ7D
hadoop@ip-172-31-24-126:~$hive
Logging initialized using configuration in file:/home/hadoop/.versions/hive-
0.8.1/conf/hive-log4j.properties
Hive history
file=/mnt/var/lib/hive_081/tmp/history/hive_job_log_hadoop_201308011448_80
0175951.txt
hive>

Create a table using HiveQL
hive> CREATE TABLE HDI(
> id INT, country STRING, hdi FLOAT, lifeex INT, mysch INT, eysch
> INT, gni INT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ","
> STORED AS TEXTFILE
> LOCATION "s3://imcbucket/data";
OK
hive> SHOW TABLES;
OK
hdi

Running a SELECT statement
hive> SELECT country, gni FROM hdi WHERE gni > 2000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201308011444_0001, Tracking URL = http://ip-172-31-24-
126:9100/jobdetails.jsp?jobid=job_201308011444_0001
Kill Command = /home/hadoop/bin/hadoop job
-Dmapred.job.tracker=172.31.24.126:9001 -kill job_201308011444_0001
Hadoop job information for Stage-1: number of mappers: 1; number of
reducers: 0
2013-08-01 14:55:53,846 Stage-1 map = 0%, reduce = 0%
2013-08-01 14:58:37,725 Stage-1 map = 100%, reduce = 100%, Cumulative
CPU 15.52 sec

Running a SELECT statement (cont.)
MapReduce Total cumulative CPU time: 15 seconds 520 msec
Ended Job = job_201308011444_0001
Counters:
MapReduce Jobs Launched:
Job 0: Map: 1 Accumulative CPU: 15.52 sec HDFS Read: 372 HDFS Write:
2435 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 520 msec
OK
Norway 47557
Australia 34431
Netherlands 36402
United States 43017
New Zealand 23737
...

Lecture
Understanding Pig

Introduction
A high-level platform for creating MapReduce programs Using Hadoop
Pig is a platform for analyzing large data sets that consists of
a high-level language for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables
them to handle very large data sets.

Pig Components
●
Two Compnents
●
Language (Pig Latin)
●
Compiler
●
Two Execution Environments
●
Local
pig -x local
●
Distributed
pig -x mapreduce
Hive.apache.org

Running Pig
●
Script
pig myscript
●
Command line (Grunt)
pig
●
Embedded
Writing a java program
Hive.apache.org

Pig Latin
Hive.apache.org

Pig Execution Stages
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Why Pig?
●
Makes writing Hadoop jobs easier
●
5% of the code, 5% of the time
●
You don't need to be a programmer to write Pig scripts
●
Provide major functionality required for
DatawareHouse and Analytics
●
Load, Filter, Join, Group By, Order, Transform
●
User can write custom UDFs (User Defined Function)
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Pig v.s. Hive
Hive.apache.org

Hands-On: Running a Pig script

Starting Pig Command Line
[hdadmin@localhost ~]$ pig -x local
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig
version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/hdadmin/pig_1375327740024.log
2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils -
Default bootup file /home/hdadmin/.pigbootup not found
2013-08-01 10:29:00,212 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: file:///
grunt>

countryFilter.pig
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float,
lifeex:int, mysch:i
nt, eysch:int, gni:int);
B = FILTER A BY gni > 2000;
C = ORDER B BY gni;
dump C;
#Preparing Data
[hdadmin@localhost ~]$ cp hadoop_data/hdi-data.csv /usr/local/pig-0.11.1/bin/
#Edit Your Script
[hdadmin@localhost ~]$ cd /usr/local/pig-0.11.1/bin/
[hdadmin@localhost ~]$ vi countryFilter.pig
Writing a Pig Script

[hdadmin@localhost ~]$ cd /usr/local/pig-0.11.1/bin/
[hdadmin@localhost ~]$ pig -x local
grunt > run countryFilter.pig
....
(150,Cameroon,0.482,51,5,10,2031)
(126,Kyrgyzstan,0.615,67,9,12,2036)
(156,Nigeria,0.459,51,5,8,2069)
(154,Yemen,0.462,65,2,8,2213)
(138,Lao People's Democratic Republic,0.524,67,4,9,2242)
(153,Papua New Guinea,0.466,62,4,5,2271)
(165,Djibouti,0.43,57,3,5,2335)
(129,Nicaragua,0.589,74,5,10,2430)
(145,Pakistan,0.504,65,4,6,2550)
Running a Pig Script

Writing a Join operation script
CountryJoin..pig
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int,
country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int,
gni:int);
B = FILTER A BY gni> 2000;
C = ORDER B BY gni;
D = load 'export-data.csv' using PigStorage(',') AS
(country:chararray, expct:float);
E = JOIN C BY country, D by country;
dump E;

Hands-On: Running a Pig script
on Amazon EMR

Upload .pig file to Amazon S3
2. Upload countryFilter-EMR.pigto the data folder

Creating a Pig program

Viewing a result

Lecture
Understanding HBase

Introduction
An open source, non-relational, distributed database
HBase is an open source, non-relational, distributed database
modeled after Google's BigTable and is written in Java. It is
developed as part of Apache Software Foundation's Apache
Hadoop project and runs on top of HDFS (, providing
BigTable-like capabilities for Hadoop. That is, it provides a
fault-tolerant way of storing large quantities of sparse data.

HBase Features
●
Column oriented data store, known as Hadoop Database
●
Support random realtime CRUD operations (unlike
HDFS)
●
No SQL Database
●
Opensource, written in Java
●
Run on a cluster of commodity hardware
Hive.apache.org

HBase Architecture
Hive.apache.org

When to use Hbase?
●
When you need high volume data to be stored
●
Un-structured data
●
Sparse data
●
Column-oriented data
●
Versioned data (same data template, captured at various
time, time-elapse data)
●
When you need high scalability
Hive.apache.org

Hands-On: Running HBase

Starting HBase shell
[hdadmin@localhost ~]$ start-hbase.sh
starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-
master-localhost.localdomain.out
[hdadmin@localhost ~]$ jps
3064 TaskTracker
2836 SecondaryNameNode
2588 NameNode
3513 Jps
3327 HMaster
2938 JobTracker
2707 DataNode
[hdadmin@localhost ~]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013
hbase(main):001:0>

Create a table and insert data in HBase
hbase(main):009:0> create 'test', 'cf'
0 row(s) in 1.0830 seconds
hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'
hbase(main):011:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1375363287644,
value=val1
hbase(main):002:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=1375363287644, value=val1

Hands-On: Running HBase commands
on Amazon EMR

Create a HBase shell

Starting Hbase Shell
$ ./elastic-mapreduce --list -j j-3MKWRS0K8IH7K
j-3MKWRS0K8IH7K WAITING ec2-54-213-117-162.us-west-
2.compute.amazonaws.comHBase Interactive
COMPLETED Start HBase
$ ./elastic-mapreduce --ssh j-3MKWRS0K8IH7K
hadoop@ip-172-31-33-161:~$ hbase shell

Recommendation to Further Study
Hadoop Beginner's Guide
Hadoop: The Definitive Guide, 3rd Edition

Hadoop in Practice
Hadoop MapReduce Cookbook

Amazon Elastic MapReduce Developer Guide

Thank you

Big Data Hadoop Hands On Workshop on Amazon EMR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Big Data Hadoop Hands On Workshop on Amazon EMR

Similar to Big Data Hadoop Hands On Workshop on Amazon EMR (20)

More from IMC Institute

More from IMC Institute (20)

Recently uploaded

Recently uploaded (20)

Big Data Hadoop Hands On Workshop on Amazon EMR