Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1
Big Data using Hadoop
Hands On Workshop
March 2015
Dr.Thanachart Numnonda
Certified Java Programmer
thanachart@imcinstitute.com
Danairat T.
Certified Java Programmer, TOGAF – Silver
danairat@gmail.com, +66-81-559-1446

Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Launch a virtual server
on EC2 Amazon Web Services

Hadoop Installation
Hadoop provides three installation choices:
1. Local mode: This is an unzip and run mode to
get you started right away where allparts of
Hadoop run within the same JVM
2. Pseudo distributed mode: This mode will be
run on different parts of Hadoop as different
Java processors, but within a single machine
3. Distributed mode: This is the real setup that
spans multiple machines

Virtual Server
This lab will use a EC2 virtual server to install a
Hadoop server using the following features:
●
Ubuntu Server 14.04 LTS
●
m3.mediun 1vCPU, 3.75 GB memory
●
Security group: default
●
Keypair: imchadoop

Select a EC2 service and click on Lunch Instance

Select an Amazon Machine Image (AMI) and
Ubuntu Server 14.04 LTS (PV)

Choose m3.medium Type virtual server

Leave configuration details as default

Add Storage: 20 GB

Name the instance

Select an existing security group > Select Security
Group Name: default

Click Launch and choose imchadoop as a key pair

Review an instance / click Connect for
an instruction to connect to the instance

Connect to an instance from Mac/Linux

Connect to an instance from Windows using Putty

Connect to the instance

Hands-On: Installing Hadoop

Installing Hadoop and Ecosystem
1. Update the system
2. Configuring SSH
3. Installing JDK1.6
4. Download/Extract Hadoop
5. Installing Hadoop
6. Configure xml files
7. Formatting HDFS
8. Start Hadoop
9. Hadoop Web Console
10. Stop Hadoop
Notes:-
Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4
stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will
encounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6

1) Update the system: sudo apt-get update

2. Configuring SSH: ssh-keygen

Enabling SSH access to your local machine
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Testing the SSH setup by connecting to your local machine
$ ssh 54.68.149.232
Type Exit
$ exit

3) Install JDK 1.7: sudo apt-get install openjdk-7-jdk
(Enter Y when prompt for answering)
(Type command > java –version

4) Download/Extract Hadoop
1) Type command > wget
http://mirror.issp.co.th/apache/hadoop/common/hadoop-1.2.1/hadoop-
1.2.1.tar.gz
2) Type command > tar –xvzf hadoop-1.2.1.tar.gz
3) Type command > sudo mv hadoop-1.2.1 /usr/local/hadoop

5) Installing Hadoop
1) Type command > sudo vi $HOME/.bashrc
2) Add config as figure below
1) Type command > exec bash
2) Type command > sudo vi /usr/local/hadoop/conf/hadoop-env.sh
3) Edit the file as figure below

6) Configuring Hadoop conf/*-site.xml
1. core-site.xml (hadoop.tmp.dir, fs.default.name)
2. hdfs-site.xml (dfs.replication)
3. mapred-site.xml (mapred.job.tracker)

Configuring core-site.xml
1) Type command > sudo vi /usr/local/hadoop/conf/core-site.xml
2)Add Private IP of a server as figure below
(in this case a private IP is 172.31.12.11)

Configuring mapred-site.xml
1) Type command > sudo sudo vi /usr/local/hadoop/conf/mapred-
site.xml
2)Add Private IP of Jobtracker server as figure below
(in this case a private IP is 172.31.12.11)

Configuring hdfs-site.xml
1) Type command > sudo vi /usr/local/hadoop/conf/hdfs-site.xml
2)Add configure as figure below

7) Formating Hadoop
1)Type command > sudo mkdir /usr/local/hadoop/tmp
2)Type command > sudo chown ubuntu /usr/local/hadoop
3)Type command > sudo chown ubuntu /usr/local/hadoop/tmp
4)Type command > hadoop namenode –format

Starting Hadoop
ubuntu@ip-172-31-12-11:~$ start-all.sh
Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
[ubuntu@ip-172-31-12-11:~$ jps
11567 Jps
10766 NameNode
11099 JobTracker
11221 TaskTracker
10899 DataNode
11018 SecondaryNameNode
ubuntu@ip-172-31-12-11:~$$
Checking Java Process and you are now running Hadoop as pseudo distributed mode

Hadoop is up!
Viewing the Hadoop HDFS using WebUI
http://54.68.149.232:50070/

Stopping Hadoop
ubuntu@ip-172-31-12-11:~$ /usr/local/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

Hands-On: Importing Data to HDFS
using Hadoop Command Line

Importing Data to Hadoop
Download War and Peace Full Text
www.gutenberg.org/ebooks/2600

Importing Data to Hadoop
Download the file pg2600.txt
$ wget https://dl.dropboxusercontent.com/u/12655380/
pg2600.txt
$hadoop fs -mkdir /input
$hadoop fs -mkdir /output
$hadoop fs -copyFromLocal pg2600.txt /input
Import to Hadoop

Hands-On: Reviewing, Retrieving,
Deleting Data from HDFS

Review file in Hadoop HDFS
ubuntu@ip-172-31-12-11:~$ hadoop fs -cat /input/pg2600.txt
List HDFS File
Read HDFS File
Retrieve HDFS File to Local File System
Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
ubuntu@ip-172-31-12-11:~$ hadoop fs -copyToLocal /input/pg2600.txt /tmp/file.txt

Review file in Hadoop HDFS using WebUI

Hadoop Port Numbers
Daemon Default
Port
Configuration Parameter in
conf/*-site.xml
HDFS Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
MR JobTracker 50030 mapred.job.tracker.http.addre
ss
Tasktrackers 50060 mapred.task.tracker.http.addr
ess

Review Content from System shell

Removing data from HDFS using
Shell Command
hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt
Deleted hdfs://localhost:54310/input/input_test.txt
hdadmin@localhost detach]$

Lecture: Understanding Map Reduce
Processing
Client
Name Node Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Map Reduce

High Level Architecture of MapReduce

Before MapReduce…
●
Large scale data processing was difficult!
– Managing hundreds or thousands of processors
– Managing parallelization and distribution
– I/O Scheduling
– Status and monitoring
– Fault/crash tolerance
●
MapReduce provides all of these, easily!
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html

MapReduce Overview
●
What is it?
– Programming model used by Google
– A combination of the Map and Reduce models with an
associated implementation
– Used for processing and generating large data sets

MapReduce Overview
●
How does it solve our previously mentioned problems?
– MapReduce is highly scalable and can be used across many
computers.
– Many small machines can be used to process jobs that
normally could not be processed by a large machine.

MapReduce Framework
Source: www.bigdatauniversity.com

How Map and Reduce Work Together

How Map and Reduce Work Together
●
Map returns information
●
Reduces accepts information
●
Reduce applies a user defined function to reduce the
amount of data

Map Abstraction
●
Inputs a key/value pair
– Key is a reference to the input value
– Value is the data set on which to operate
●
Evaluation
– Function defined by user
– Applies to every value in value input
●
Might need to parse input
●
Produces a new list of key/value pairs
– Can be different type from input pair

Reduce Abstraction
●
Starts with intermediate Key / Value pairs
●
Ends with finalized Key / Value pairs
●
Starting pairs are sorted by key
●
Iterator supplies the values for a given key to the
Reduce function.

Reduce Abstraction
●
Typically a function that:
– Starts with a large number of key/value pairs
●
One key/value for each word in all files being greped
(including multiple entries for the same word)
– Ends with very few key/value pairs
●
One key/value for each unique word across all the files with
the number of instances summed into this entry
●
Broken up so a given worker works with input of the
same key.

Other Applications
●
Yahoo!
– Webmap application uses Hadoop to create a database of
information on all known webpages
●
Facebook
– Hive data center uses Hadoop to provide business statistics to
application developers and advertisers
●
Rackspace
– Analyzes sever log files and usage data using Hadoop

Why is this approach better?
●
Creates an abstraction for dealing with complex
overhead
– The computations are simple, the overhead is messy
●
Removing the overhead makes programs much
smaller and thus easier to use
– Less testing is required as well. The MapReduce
libraries can be assumed to work properly, so only
user code needs to be tested
●
Division of labor also handled by the
MapReduce libraries, so programmers only
need to focus on the actual computation

MapReduce Framework
map: (K1, V1) -> list(K2, V2))
reduce: (K2, list(V2)) -> list(K3, V3)

How does the MapReduce work?
Output in a list of (Key, List of Values)
in the intermediate file
Sorting
Partitioning
Output in a list of (Key, Value)
InputSplit
RecordReader
RecordWriter

How does the MapReduce work?
Sorting
Partitioning
Combining
Car, 2
Car, 2
Bear, {1,1}
Car, {2,1}
River, {1,1}
Deer, {1,1}
Output in a list of (Key, List of Values)
Output in a list of (Key, Value)
InputSplit
RecordReader
RecordWriter

MapReduce Processing – The Data
flow
1. InputFormat, InputSplits, RecordReader
2. Mapper - your focus is here
3. Partition, Shuffle & Sort
4. Reducer - your focus is here
5. OutputFormat, RecordWriter

InputFormat
InputFormat: Description: Key: Value:
TextInputFormat
Default format; reads
lines of text files
The byte offset of the
line
The line contents
KeyValueInputFormat
Parses lines into key,
val pairs
Everything up to the
first tab character
The remainder of the
line
SequenceFileInputFor
mat
A Hadoop-specific
high-performance
binary format
user-defined user-defined

InputSplit
An InputSplit describes a unit of work that comprises a single map
task.
InputSplit presents a byte-oriented view of the input.
You can control this value by setting the mapred.min.split.size
parameter in core-site.xml, or by overriding the parameter in the
JobConf object used to submit a particular MapReduce job.
RecordReader
RecordReader reads <key, value> pairs from an InputSplit.
Typically the RecordReader converts the byte-oriented view of
the input, provided by the InputSplit, and presents a record-
oriented to the Mapper

Mapper
Mapper: The Mapper performs the user-defined logic to the input a
key, value and emits (key, value) pair(s) which are forwarded to the
Reducers.
Partition, Shuffle & Sort
After the first map tasks have completed, the nodes may still be
performing several more map tasks each. But they also begin
exchanging the intermediate outputs from the map tasks to where they
are required by the reducers.
Partitioner controls the partitioning of map-outputs to assign to reduce
task . he total number of partitions is the same as the number of reduce
tasks for the job
The set of intermediate keys on a single node is automatically sorted
by internal Hadoop before they are presented to the Reducer
This process of moving map outputs to the reducers is known as
shuffling.

Reducer
This is an instance of user-provided code that performs read each
key, iterator of values in the partition assigned. The OutputCollector
object in Reducer phase has a method named collect() which will
collect a (key, value) output.
OutputFormat, Record Writer
OutputFormat governs the writing format in OutputCollector and
RecordWriter writes output into HDFS.
OutputFormat: Description
TextOutputFormat
Default; writes lines in "key t value"
form
SequenceFileOutputFormat
Writes binary files suitable for
reading into subsequent MapReduce
jobs
NullOutputFormat generates no output files

Hands-On: Writing you own Map
Reduce Program

Wordcount (HelloWord in Hadoop)
1. package org.myorg;
2.
3. import java.io.IOException;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14.
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {
15. private final static IntWritable one = new IntWritable(1);
16. private Text word = new Text();
17.
18.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
19. String line = value.toString();
20. StringTokenizer tokenizer = new StringTokenizer(line);
21. while (tokenizer.hasMoreTokens()) {
22. word.set(tokenizer.nextToken());
23. output.collect(word, one);
24. }
25. }
26. }

27.
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,
IntWritable> {
29.
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
30. int sum = 0;
31. while (values.hasNext()) {
32. sum += values.next().get();
33. }
34. output.collect(key, new IntWritable(sum));
35. }
36. }
37.

38. public static void main(String[] args) throws Exception {
39. JobConf conf = new JobConf(WordCount.class);
40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46.
47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class);
50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[0]));
53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
54.
55. JobClient.runJob(conf);
57. }
58. }
59.

Hands-On: Packaging Map Reduce
and Deploying to Hadoop Runtime
Environment

Packaging Map Reduce Program
Usage
Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version
installed, compile WordCount.java and create a jar:
$ wget https://dl.dropboxusercontent.com/u/12655380/WordCount.java
$ mkdir hduser
$ cd hduser
javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar -d hduser WordCount.java
$ jar -cvf ./wordcount.jar -C hduser/ .
$ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir
Output:
…….
$ hadoop fs -cat /output/wordcount_output_dir/part-00000

Reviewing MapReduce Output Result

Hands-On: Writing Map/Reduce
Program on Eclipse

Starting Eclipse

Create a Java Project
Let's name it HadoopWordCount

Add dependencies to the project
●
Add the following two JARs to your build path
●
hadoop-common.jar and hadoop-mapreduce-client-core.jar. Both can be
founded at /usr/lib/hadoop/client
●
By perform the following steps
– Add a folder named lib to the project
– Copy the mentioned JARs in this folder
– Right-click on the project name >> select Build Path >> then
Configure Build Path
– Click on Add Jars, select these two JARs from the lib folder

Add dependencies to the project

Writing a source code
●
Right click the project, the select New >> Package
●
Name the package as org.myorg
●
Right click at org.myorg, the select New >> Class
●
Name the package as WordCount
●
Writing a source code as shown in previoud slides

Building a Jar file
●
Right click the project, the select Export
●
Select Java and then JAR file
●
Provide the JAR name, as wordcount.jar
●
Leave the JAR package options as default
●
In the JAR Manifest Specification section, in the botton, specify the Main
class
●
In this case, select WordCount
●
Click on Finish
●
The JAR file will be build and will be located at cloudera/workspace
Note: you may need to re-size the dialog font size by select
Windows >> Preferences >> Appearance >> Colors and Fonts

Lecture
Understanding Hive

Introduction
A Petabyte Scale Data Warehouse Using Hadoop
Hive is developed by Facebook, designed to enable easy data
summarization, ad-hoc querying and analysis of large
volumes of data. It provides a simple query language called
Hive QL, which is based on SQL

What Hive is NOT
Hive is not designed for online transaction processing and
does not offer real-time queries and row level updates. It is
best used for batch jobs over large sets of immutable data
(like web logs, etc.).

Hive Metastore
●
Store Hive metadata
●
Configurations
– Embedded: in-process metastore, in-process database
– Local: in-process metastore, out-of-process database
– Remote: out-of-process metastore,out-of-process database

Hive Schema-On-Read
●
Faster loads into the database (simply copy or move)
●
Slower queries
●
Flexibility – multiple schemas for the same data

HiveQL
●
Hive Query Language
●
SQL dialect
●
No support for:
– UPDATE, DELETE
– Transactions
– Indexes
– HAVING clause in SELECT
– Updateable or materialized views
– Srored procedure

Hive Tables
●
Managed- CREATE TABLE
– LOAD- File moved into Hive's data warehouse directory
– DROP- Both data and metadata are deleted.
●
External- CREATE EXTERNAL TABLE
– LOAD- No file moved
– DROP- Only metadata deleted
– Use when sharing data between Hive and Hadoop applications
or you want to use multiple schema on the same data

Running Hive
Hive Shell
●
Interactive
hive
●
Script
hive -f myscript
●
Inline
hive -e 'SELECT * FROM mytable'
Hive.apache.org

System Architecture and Components
•
Metastore: To store the meta data.
•
Query compiler and execution engine: To convert SQL queries to a
sequence of map/reduce jobs that are then executed on Hadoop.
•
SerDe and ObjectInspectors: Programmable interfaces and
implementations of common data formats and types.
A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java
object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.
•
UDF and UDAF: Programmable interfaces and implementations for
user defined functions (scalar and aggregate functions).
•
Clients: Command line client similar to Mysql command line.
hive.apache.org

Architecture Overview
HDFS
Hive CLI
QueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.
WebUI
HDFS
DDL
Hive
Hive.apache.org

Sample HiveQL
The Query compiler uses the information stored in the metastore to
convert SQL queries into a sequence of map/reduce jobs, e.g. the
following query
SELECT * FROM t where t.c = 'xyz'
SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)
SELECT t1.c1, count(1) from t1 group by t1.c1
Hive.apache.org

Hands-On: Creating Table and
Retrieving Data using Hive

Hive Hands-On Labs
1. Installing Hive
2. Configuring / Starting Hive
3. Creating Hive Table
4. Reviewing Hive Table in HDFS
5. Alter and Drop Hive Table
6. Preparing Dataset
7. Loading Data to Hive Table
8. Querying Data from Hive Table
9. Reviewing Hive Table Content from HDFS Command
and WebUI

1. Installing Hive
# wget http://apache.mesi.com.ar/hive/hive-1.1.0/
apache-hive-1.1.0-bin.tar.gz
# tar -xvzf apache-hive-1.1.0-bin.tar.gz
# sudo mv apache-hive-1.1.0-bin /usr/local
# rm apache-hive-1.1.0-bin.tar.gz
Install Hive binary file

1. Installing Hive
Edit $HOME ./bashrc
# sudo vi $HOME/.bashrc

2. Configuring Hive
Creating HDFS Directory for Hive
Create hdfs /tmp and /user/hive/warehouse directory
[hdadmin@localhost ~]$ hadoop fs -mkdir /tmp/hive
[hdadmin@localhost ~]$ hadoop fs -mkdir /user/hive/warehouse
[hdadmin@localhost ~]$ hadoop fs -chmod 777 /tmp/hive
[hdadmin@localhost ~]$ hadoop fs -chmod 777 /user/hive/warehouse

2. Start Hive
Starting Hive
hive> quit;
Quit from Hive

3. Creating Hive Table
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 4.069 seconds
hive (default)> show tables;
OK
test_tbl
hive (default)> describe test_tbl;
OK
id int
country string
hive (default)>
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html

4. Reviewing Hive Table in HDFS
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl
[hdadmin@localhost hdadmin]$
Review Hive Table from
HDFS WebUI

5. Alter and Drop Hive Table
hive (default)> alter table test_tbl add columns (remarks STRING);
hive (default)> describe test_tbl;
OK
id int
country string
remarks string
hive (default)> drop table test_tbl;
OK
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html

6. Preparing Large Dataset
http://grouplens.org/datasets/movielens/

MovieLen Dataset
1)Type command > wget
http://files.grouplens.org/datasets/movielens/ml-100k.zip
2)Type command > sudo apt-get install unzip
3)Type command > unzip ml-100k.zip
4)Type command > more ml-100k/u.user

6. Loading Data to Hive Table
hive (default)> exit;
ubuntu@ip-172-31-12-11:~/ml-100k$ hadoop fs -put u.user /dataset/movielens/users
Loading data to Hive table
$ hive
hive (default)> CREATE EXTERNAL TABLE users (userid INT, age INT,
gender STRING, occupation STRING, zipcode STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '/dataset/movielens/users';
Creating Hive table

7. Querying Data from Hive Table

8. Loading Data to test_tbl Table
$ hive
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Creating Hive table
hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLE
test_tbl;
Copying data from file:/tmp/test_tbl_data.csv
Copying file: file:/tmp/test_tbl_data.csv
Loading data to table default.test_tbl
OK
hive (default)>
Loading data to Hive table

9. Reviewing Hive Table Content from HDFS Command
and WebUI
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl
Found 1 items
-rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08
/user/hive/warehouse/test_tbl/test_tbl_data.csv
[hdadmin@localhost hdadmin]$ hadoop fs -cat
/user/hive/warehouse/test_tbl/test_tbl_data.csv
1,USA
62,Indonesia
63,Philippines
65,Singapore
66,Thailand

Loading Data to Hive Table
$ hive
hive (default)> hive> CREATE TABLE products
(
prod_name STRING,
description STRING,
category STRING,
qty_on_hand INT,
prod_num STRING,
packaged_with ARRAY<STRING>
)
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
Creating Hive table

Lecture
Understanding Pig

Introduction
A high-level platform for creating MapReduce programs Using Hadoop
Pig is a platform for analyzing large data sets that consists of
a high-level language for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables
them to handle very large data sets.

Pig Components
●
Two Compnents
●
Language (Pig Latin)
●
Compiler
●
Two Execution Environments
●
Local
pig -x local
●
Distributed
pig -x mapreduce
Hive.apache.org

Running Pig
●
Script
pig myscript
●
Command line (Grunt)
pig
●
Embedded
Writing a java program
Hive.apache.org

Pig Latin
Hive.apache.org

Pig Execution Stages
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Why Pig?
●
Makes writing Hadoop jobs easier
●
5% of the code, 5% of the time
●
You don't need to be a programmer to write Pig scripts
●
Provide major functionality required for
DatawareHouse and Analytics
●
Load, Filter, Join, Group By, Order, Transform
●
User can write custom UDFs (User Defined Function)
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Pig v.s. Hive
Hive.apache.org

Hands-On: Running a Pig script

Installing Pig
# wget
http://archive.apache.org/dist/hadoop/pig/stable/
pig-0.7.0.tar.gz
# tar -xvzf pig-0.7.0.tar.gz
# sudo mv pig-0.7.0 /usr/local/
# rm pig-0.7.0.tar.gz
Install Pig binary file

Installing Pig
Edit $HOME ./bashrc

Starting Pig Command Line

countryFilter.pig
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float,
lifeex:int, mysch:int, eysch:int, gni:int);
B = FILTER A BY gni > 2000;
C = ORDER B BY gni;
dump C;
#Preparing Data
ubuntu@ip-172-31-12-11:~$ wget https://www.dropbox.com/s/pp168a6oiwqkxyu/
hdi-data.csv
#Edit Your Script
ubuntu@ip-172-31-12-11:~$ vi countryFilter.pig
Writing a Pig Script

ubuntu@ip-172-31-12-11:~$ pig -x local
grunt > run countryFilter.pig
Running a Pig Script

Lecture: Understanding Sqoop

Introduction
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line
tool with the following capabilities:
•
Imports individual tables or entire databases to files in
HDFS
•
Generates Java classes to allow you to interact with your
imported data
•
Provides the ability to import from SQL databases straight
into your Hive data warehouse
See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

Architecture Overview
Hive.apache.org

Hands-On: Loading Data from DBMS
to Hadoop HDFS

Sqoop Hands-On Labs
1. Loading Data into MySQL DB
2. Installing Sqoop
3. Configuring Sqoop
4. Installing DB driver for Sqoop
5. Importing data from MySQL to Hive Table
6. Reviewing data from Hive Table
7. Reviewing HDFS Database Table files

1. MySQL RDS Server on AWS
A RDS Server is running on AWS with the following
configuration
> database: imc_db
> username: admin
> password: imcinstitute
>addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com
[This address may change]

1. country_tbl data
Testing data query from MySQL DB
Table name > country_tbl

2. Installing Sqoop
# wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-
1.0.0.tar.gz
# tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz
# sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/
# rm sqoop-1.4.5.bin__hadoop-1.0.0

Installing Sqoop
Edit $HOME ./bashrc

3. Configuring Sqoop
ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-
1.0.0/conf/
ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh

4. Installing DB driver for Sqoop
ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-
1.0.0/lib/
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$
wget
https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$
exit

5. Importing data from MySQL to Hive Table
[hdadmin@localhost ~]$sqoop import --connect
jdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west-
2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl
--hive-import --hive-table country -m 1
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Enter password: <enter here>

6. Reviewing data from Hive Table

Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse

Lecture
Understanding HBase

Introduction
An open source, non-relational, distributed database
HBase is an open source, non-relational, distributed database
modeled after Google's BigTable and is written in Java. It is
developed as part of Apache Software Foundation's Apache
Hadoop project and runs on top of HDFS (, providing
BigTable-like capabilities for Hadoop. That is, it provides a
fault-tolerant way of storing large quantities of sparse data.

HBase Features
●
Hadoop database modelled after Google's Bigtab;e
●
Column oriented data store, known as Hadoop Database
●
Support random realtime CRUD operations (unlike
HDFS)
●
No SQL Database
●
Opensource, written in Java
●
Run on a cluster of commodity hardware
Hive.apache.org

When to use Hbase?
●
When you need high volume data to be stored
●
Un-structured data
●
Sparse data
●
Column-oriented data
●
Versioned data (same data template, captured at various
time, time-elapse data)
●
When you need high scalability
Hive.apache.org

Which one to use?
●
HDFS
●
Only append dataset (no random write)
●
Read the whole dataset (no random read)
●
HBase
●
Need random write and/or read
●
Has thousands of operation per second on TB+ of data
●
RDBMS
●
Data fits on one big node
●
Need full transaction support
●
Need real-time query capabilities
Hive.apache.org

HBase Components
Hive.apache.org
●
Region
●
Row of table are stores
●
Region Server
●
Hosts the tables
●
Master
●
Coordinating the Region
Servers
●
ZooKeeper
●
HDFS
●
API
●
The Java Client API

HBase Architecture
Hive.apache.org

HBase Shell Commands
Hive.apache.org

Hands-On: Running HBase

Installing HBase
# wget http://apache.cs.utah.edu/hbase/hbase-1.0.0/hbase-1.0.0-bin.tar.gz
# tar -xvzf hbase-1.0.0-bin.tar.gz
# sudo mv hbase-1.0.0 /usr/local/
# rm hbase-1.0.0-bin.tar.gz

Installing HBase
Edit $HOME ./bashrc

Starting HBase shell
ubuntu@ip-172-31-12-11:~$ start-hbase.sh
starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-
master-localhost.localdomain.out
ubuntu@ip-172-31-12-11:~$$ jps
3064 TaskTracker
2836 SecondaryNameNode
2588 NameNode
3513 Jps
3327 HMaster
2938 JobTracker
2707 DataNode
ubuntu@ip-172-31-12-11:~$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013
hbase(main):001:0>

Create a table and insert data in HBase
hbase(main):009:0> create 'test', 'cf'
0 row(s) in 1.0830 seconds
hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'
hbase(main):011:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1375363287644,
value=val1
hbase(main):002:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=1375363287644, value=val1

Recommendation to Further Study

Thank you
www.imcinstitute.com
www.facebook.com/imcinstitute

Hadoop Workshop on EC2 : March 2015

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hadoop Workshop on EC2 : March 2015

Ähnlich wie Hadoop Workshop on EC2 : March 2015 (20)

Mehr von IMC Institute

Mehr von IMC Institute (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop Workshop on EC2 : March 2015