This document outlines an agenda for a hands-on workshop on running Hadoop on public clouds. It describes how to use Cloudera on GoGrid and Amazon EMR to run Hadoop and import/export data. It also covers writing MapReduce programs, packaging and deploying them on Hadoop, and running a sample WordCount job on Amazon EMR.
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
1. Big Data on Public Cloud
Using Cloudera on GoGrid
& Amazon EMR
Hands On Workshop
October 2014
Dr.Thanachart Numnonda
IMC Institute
thanachart@imcinstitute.com
Modifiy from Original Version by Danairat T.
Certified Java Programmer, TOGAF – Silver
danairat@gmail.com
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 1 danairat@gmail.com
2. Hands-On: Running Hadoop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
3. Running this lab using Cloudera Live
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
4. Using Cloudera Live on GoGrid
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
5. Alternatively Using Cloudera VM
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
6. Starting Hue on Cloudera
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
7. Viewing HDFS
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
8. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
9. Hands-On: Importing/Exporting
Data to HDFS
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
10. Importing Data to Hadoop
Download War and Peace Full Text
www.gutenberg.org/ebooks/2600
$hadoop fs -mkdir input
$hadoop fs -mkdir output
$hadoop fs -copyFromLocal Downloads/pg2600.txt input
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
11. Review file in Hadoop HDFS
List HDFS File
Read HDFS File
[hdadmin@localhost bin]$ hadoop fs -cat input/pg2600.txt
Retrieve HDFS File to Local File System
[hdadmin@localhost bin]$ hadoop fs -copyToLocal input/pg2600.txt tmp/file.txt
Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
12. Review file in Hadoop HDFS using
File Browse
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
13. Review file in Hadoop HDFS using Hue
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
14. Hadoop Port Numbers
Daemon Default
Port
Configuration Parameter in
conf/*-site.xml
HDFS Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
MR JobTracker 50030 mapred.job.tracker.http.addre
ss
Tasktrackers 50060 mapred.task.tracker.http.addr
ess
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
15. Removing data from HDFS using
Shell Command
hdadmin@localhost detach]$ hadoop fs -rm input/pg2600.txt
Deleted hdfs://localhost:54310/input/pg2600.txt
hdadmin@localhost detach]$
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
16. Client
Map Reduce
Name Node Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Lecture: Understanding Map Reduce
Processing
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
17. High Level Architecture of MapReduce
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
18. Before MapReduce…
● Large scale data processing was difficult!
– Managing hundreds or thousands of processors
– Managing parallelization and distribution
– I/O Scheduling
– Status and monitoring
– Fault/crash tolerance
● MapReduce provides all of these, easily!
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 18 danairat@gmail.com
19. MapReduce Overview
● What is it?
– Programming model used by Google
– A combination of the Map and Reduce models with an
associated implementation
– Used for processing and generating large data sets
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 19 danairat@gmail.com
20. MapReduce Overview
● How does it solve our previously mentioned problems?
– MapReduce is highly scalable and can be used across many
computers.
– Many small machines can be used to process jobs that
normally could not be processed by a large machine.
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 20 danairat@gmail.com
21. MapReduce Framework
Source: www.bigdatauniversity.com
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
22. How does the MapReduce work?
Partitioning
Sorting
RecordWriter
Output in a list of (Key, List of Values)
in the intermediate file
InputSplit
RecordReader
Output in a list of (Key, Value)
in the intermediate file
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
23. How does the MapReduce work?
Partitioning
Sorting
RecordWriter
Output in a list of (Key, List of Values)
in the intermediate file
InputSplit
RecordReader
Output in a list of (Key, Value)
in the intermediate file
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
24. Map Abstraction
● Inputs a key/value pair
– Key is a reference to the input value
– Value is the data set on which to operate
● Evaluation
– Function defined by user
– Applies to every value in value input
● Might need to parse input
● Produces a new list of key/value pairs
– Can be different type from input pair
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 24 danairat@gmail.com
25. Reduce Abstraction
● Starts with intermediate Key / Value pairs
● Ends with finalized Key / Value pairs
● Starting pairs are sorted by key
● Iterator supplies the values for a given key to the
Reduce function.
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 25 danairat@gmail.com
26. Reduce Abstraction
● Typically a function that:
– Starts with a large number of key/value pairs
● One key/value for each word in all files being greped
(including multiple entries for the same word)
– Ends with very few key/value pairs
● One key/value for each unique word across all the files with
the number of instances summed into this entry
● Broken up so a given worker works with input of the
same key.
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 26 danairat@gmail.com
27. How Map and Reduce Work Together
● Map returns information
● Reduces accepts information
● Reduce applies a user defined function to reduce the
amount of data
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 27 danairat@gmail.com
28. Other Applications
● Yahoo!
– Webmap application uses Hadoop to create a database of
information on all known webpages
● Facebook
– Hive data center uses Hadoop to provide business statistics to
application developers and advertisers
● Rackspace
– Analyzes sever log files and usage data using Hadoop
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 28 danairat@gmail.com
29. MapReduce Framework
map: (K1, V1) -> list(K2, V2))
reduce: (K2, list(V2)) -> list(K3, V3)
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
30. Hands-On: Writing you own Map
Reduce Program
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
31. Wordcount (HelloWord in Hadoop)
1. package org.myorg;
2.
3. import java.io.IOException;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {
15. private final static IntWritable one = new IntWritable(1);
16. private Text word = new Text();
17.
18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
19. String line = value.toString();
20. StringTokenizer tokenizer = new StringTokenizer(line);
21. while (tokenizer.hasMoreTokens()) {
22. word.set(tokenizer.nextToken());
23. output.collect(word, one);
24. }
25. }
26. }
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
32. Wordcount (HelloWord in Hadoop)
27.
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,
IntWritable> {
29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
30. int sum = 0;
31. while (values.hasNext()) {
32. sum += values.next().get();
33. }
34. output.collect(key, new IntWritable(sum));
35. }
36. }
37.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
33. Wordcount (HelloWord in Hadoop)
38. public static void main(String[] args) throws Exception {
39. JobConf conf = new JobConf(WordCount.class);
40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46.
47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class);
50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[0]));
53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
54.
55. JobClient.runJob(conf);
57. }
58. }
59.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
34. Hands-On: Packaging Map Reduce
and Deploying to Hadoop Runtime
Environment
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
35. Packaging Map Reduce Program
Usage
Assuming we have already downloaded wordcount.jar into the folder /Downloads:
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
36. Reviewing MapReduce Job in Hue
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
37. Reviewing MapReduce Job in Hue
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
38. Reviewing MapReduce Output Result
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
39. Reviewing MapReduce Output Result
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
40. Reviewing MapReduce Output Result
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
41. Running MapReduce Job Using Oozie
: Select Java
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
42. Hands-On: Running WordCount.jar
on Amazon EMR
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
43. Architecture Overview of Amazon EMR
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
44. Creating an AWS account
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
45. Signing up for the necessary services
● Simple Storage Service (S3)
● Elastic Compute Cloud (EC2)
● Elastic MapReduce (EMR)
Caution! This costs real money!
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
46. Creating Amazon S3 bucket
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
47. Upload .jar file and input file to
Amazon S3
1. Select <yourbucket> in Amazon S3 service
2. Create folder : applications
3. Upload wordcount.jar to the applications folder
4. Create another folder: input
5. Upload input_test.txt to the input folder
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
48. Create a new cluster in EMR
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
49. Assign a cluster name and log location
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
50. Assign hardware configuration
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
51. Select Custom JAR
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
52. Input JAR Location and Arguments
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
53. Select Create Cluster
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
54. Monitoring the cluster
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
55. View Result from the S3 bucket
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
56. Lecture
Understanding Hive
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
57. Introduction
A Petabyte Scale Data Warehouse Using Hadoop
Hive is developed by Facebook, designed to enable easy data
summarization, ad-hoc querying and analysis of large
volumes of data. It provides a simple query language called
Hive QL, which is based on SQL
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
58. What Hive is NOT
Hive is not designed for online transaction processing and
does not offer real-time queries and row level updates. It is
best used for batch jobs over large sets of immutable data
(like web logs, etc.).
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
59. Hive Metastore
● Store Hive metadata
● Configurations
– Embedded: in-process metastore, in-process database
– Local: in-process metastore, out-of-process database
– Remote: out-of-process metastore,out-of-process database
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 59 danairat@gmail.com
60. Hive Schema-On-Read
● Faster loads into the database (simply copy or move)
● Slower queries
● Flexibility – multiple schemas for the same data
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 60 danairat@gmail.com
61. HiveQL
● Hive Query Language
● SQL dialect
● No support for:
– UPDATE, DELETE
– Transactions
– Indexes
– HAVING clause in SELECT
– Updateable or materialized views
– Srored procedure
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 61 danairat@gmail.com
62. Hive Tables
● Managed- CREATE TABLE
– LOAD- File moved into Hive's data warehouse directory
– DROP- Both data and metadata are deleted.
● External- CREATE EXTERNAL TABLE
– LOAD- No file moved
– DROP- Only metadata deleted
– Use when sharing data between Hive and Hadoop applications
or you want to use multiple schema on the same data
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 62 danairat@gmail.com
63. Running Hive
Hive Shell
● Interactive
hive
● Script
hive -f myscript
● Inline
hive -e 'SELECT * FROM mytable'
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
64. System Architecture and Components
• Metastore: To store the meta data.
• Query compiler and execution engine: To convert SQL queries to a
sequence of map/reduce jobs that are then executed on Hadoop.
• SerDe and ObjectInspectors: Programmable interfaces and
implementations of common data formats and types.
A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java
object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.
• UDF and UDAF: Programmable interfaces and implementations for
user defined functions (scalar and aggregate functions).
• Clients: Command line client similar to Mysql command line.
hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
65. Architecture Overview
HDFS
Hive
Hive CLI
Browsing Queries
Map Reduce
Thrift API
MetaStore
Execution
SerDe
Hive QL
Thrift Jute JSON..
DDL
Parser
Planner
Mgmt.
Web UI
HDFS
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
66. Sample HiveQL
The Query compiler uses the information stored in the metastore to
convert SQL queries into a sequence of map/reduce jobs, e.g. the
following query
SELECT * FROM t where t.c = 'xyz'
SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)
SELECT t1.c1, count(1) from t1 group by t1.c1
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
67. Hands-On: Creating Table and
Retrieving Data using Hive
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
68. Running Hive from terminal
Starting Hive
Quit from Hive
hive> quit;
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
69. Starting Hive Editor from Hue
Scroll Down
the web page
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
70. Creating Hive Table
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 4.069 seconds
hive (default)> show tables;
OK
test_tbl
Time taken: 0.138 seconds
hive (default)> describe test_tbl;
OK
id int
country string
Time taken: 0.147 seconds
hive (default)>
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
71. Using Hue Query Editor
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
72. Using Hue Query Editor
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
73. Using Hue Query Editor
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
74. Reviewing Hive Table in HDFS
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
75. Alter and Drop Hive Table
hive (default)> alter table test_tbl add columns (remarks STRING);
hive (default)> describe test_tbl;
OK
id int
country string
remarks string
Time taken: 0.077 seconds
hive (default)> drop table test_tbl;
OK
Time taken: 0.9 seconds
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
76. Loading Data to Hive Table
Creating Hive table
$ hive
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Loading data to Hive table
hive (default)> LOAD DATA LOCAL INPATH '/tmp/country.csv' INTO TABLE test_tbl;
Copying data from file:/tmp/test_tbl_data.csv
Copying file: file:/tmp/test_tbl_data.csv
Loading data to table default.test_tbl
OK
Time taken: 0.241 seconds
hive (default)>
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
77. Querying Data from Hive Table
hive (default)> select * from test_tbl;
OK
1 USA
62 Indonesia
63 Philippines
65 Singapore
66 Thailand
Time taken: 0.287 seconds
hive (default)>
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
78. Insert Overwriting the Hive Table
hive (default)> LOAD DATA LOCAL INPATH
'/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl;
Copying data from file:/tmp/test_tbl_data_updated.csv
Copying file: file:/tmp/test_tbl_data_updated.csv
Loading data to table default.test_tbl
Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl
OK
Time taken: 0.204 seconds
hive (default)>
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
79. Lecture
Understanding Pig
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
80. Introduction
A high-level platform for creating MapReduce programs Using Hadoop
Pig is a platform for analyzing large data sets that consists of
a high-level language for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables
them to handle very large data sets.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
81. Pig Components
● Two Compnents
● Language (Pig Latin)
● Compiler
● Two Execution Environments
● Local
pig -x local
● Distributed
pig -x mapreduce
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
82. Running Pig
● Script
pig myscript
● Command line (Grunt)
pig
● Embedded
Writing a java program
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
83. Pig Latin
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
84. Pig Execution Stages
Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
85. Why Pig?
● Makes writing Hadoop jobs easier
● 5% of the code, 5% of the time
● You don't need to be a programmer to write Pig scripts
● Provide major functionality required for
DatawareHouse and Analytics
● Load, Filter, Join, Group By, Order, Transform
● User can write custom UDFs (User Defined Function)
Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
86. Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
87. Hands-On: Running a Pig script
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
88. Starting Pig Command Line
[hdadmin@localhost ~]$ pig -x local
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig
version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/hdadmin/pig_1375327740024.log
2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils -
Default bootup file /home/hdadmin/.pigbootup not found
2013-08-01 10:29:00,212 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: file:///
grunt>
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
89. Starting Pig from Hue
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
90. Writing a Pig Script
#Preparing Data
Download hdi-data.csv
#Edit Your Script
[hdadmin@localhost ~]$ cd Downloads/
[hdadmin@localhost ~]$ vi countryFilter.pig
countryFilter.pig
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float,
lifeex:int, mysch:i
nt, eysch:int, gni:int);
B = FILTER A BY gni > 2000;
C = ORDER B BY gni;
dump C;
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
91. Running a Pig Script
[hdadmin@localhost ~]$ cd Downloads
[hdadmin@localhost ~]$ pig -x local
grunt > run countryFilter.pig
....
(150,Cameroon,0.482,51,5,10,2031)
(126,Kyrgyzstan,0.615,67,9,12,2036)
(156,Nigeria,0.459,51,5,8,2069)
(154,Yemen,0.462,65,2,8,2213)
(138,Lao People's Democratic Republic,0.524,67,4,9,2242)
(153,Papua New Guinea,0.466,62,4,5,2271)
(165,Djibouti,0.43,57,3,5,2335)
(129,Nicaragua,0.589,74,5,10,2430)
(145,Pakistan,0.504,65,4,6,2550)
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
92. Lecture: Understanding Sqoop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
93. Introduction
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line
tool with the following capabilities:
• Imports individual tables or entire databases to files in
HDFS
• Generates Java classes to allow you to interact with your
imported data
• Provides the ability to import from SQL databases straight
into your Hive data warehouse
See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
94. Architecture Overview
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
95. Hands-On: Loading Data from DBMS
to Hadoop HDFS
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
96. Loading Data into MySQL DB
[root@localhost ~]# service mysqld start
Starting mysqld: [ OK ]
[root@localhost ~]# mysql
mysql> create database countrydb;
Query OK, 1 row affected (0.00 sec)
mysql> use countrydb;
Database changed
mysql> create table country_tbl(id INT, country VARCHAR(100));
Query OK, 0 rows affected (0.02 sec)
mysql> LOAD DATA LOCAL INFILE '/tmp/country_data.csv' INTO TABLE
country_tbl FIELDS terminated by ',' LINES TERMINATED BY 'rn';
Query OK, 237 rows affected, 4 warnings (0.00 sec)
Records: 237 Deleted: 0 Skipped: 0 Warnings: 0
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
97. Loading Data into MySQL DB
Testing data query from MySQL DB
mysql> select * from country_tbl;
+------+------------------------------+
| id | country |
+------+------------------------------+
| 93 | Afghanistan |
| 355 | Albania |
| 213 | Algeria |
| 1684 | AmericanSamoa |
| 376 | Andorra |
...
+------+------------------------------+
237 rows in set (0.00 sec)
mysql>
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
98. Installing DB driver for Sqoop
[hdadmin@localhost conf]$ su -
Password:
[root@localhost ~]# cp /01_hadoop_workshop/03_sqoop1.4.2/mysql-connector-
java-5.0.8-bin.jar /usr/local/sqoop-1.4.2.bin__hadoop-
0.20/lib/
[root@localhost ~]# chown hdadmin:hadoop /usr/local/sqoop-
1.4.2.bin__hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar
[root@localhost ~]# exit
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
99. Importing data from MySQL to Hive Table
[hdadmin@localhost ~]$ sqoop import --connect
jdbc:mysql://localhost/countrydb --username root -P --table
country_tbl --hive-import --hive-table country_tbl -m 1
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Enter password: <enter here>
13/03/21 18:07:43 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
13/03/21 18:07:43 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
13/03/21 18:07:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
13/03/21 18:07:43 INFO tool.CodeGenTool: Beginning code generation
13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1
13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1
13/03/21 18:07:44 INFO orm.CompilationManager: HADOOP_HOME is /usr/local/hadoop/libexec/..
Note: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/03/21 18:07:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.jar
13/03/21 18:07:44 WARN manager.MySQLManager: It looks like you are importing from mysql.
13/03/21 18:07:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
13/03/21 18:07:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
13/03/21 18:07:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
13/03/21 18:07:44 INFO mapreduce.ImportJobBase: Beginning import of country_tbl
13/03/21 18:07:45 INFO mapred.JobClient: Running job: job_201303211744_0001
13/03/21 18:07:46 INFO mapred.JobClient: map 0% reduce 0%
13/03/21 18:08:02 INFO mapred.JobClient: map 100% reduce 0%
13/03/21 18:08:07 INFO mapred.JobClient: Job complete: job_201303211744_0001
13/03/21 18:08:07 INFO mapred.JobClient: Counters: 18
13/03/21 18:08:07 INFO mapred.JobClient: Job Counters
13/03/21 18:08:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12154
13/03/21 18:08:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
Big Data Hadoop 13/03/21 18:on 08:07 INFO Amazon mapred.JobClient: Total EMR time spent by – all Hands maps waiting after reserving On slots Workshop (ms)=0
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013
13/03/21 18:08:07 INFO mapred.JobClient: Launched map tasks=1
100. Reviewing data from Hive Table
[hdadmin@localhost ~]$ hive
Logging initialized using configuration in file:/usr/local/hive-0.9.0-
bin/conf/hive-log4j.properties
Hive history
file=/tmp/hdadmin/hive_job_log_hdadmin_201303211810_964909984.txt
hive (default)> show tables;
OK
country_tbl
test_tbl
Time taken: 2.566 seconds
hive (default)> select * from country_tbl;
OK
93 Afghanistan
355 Albania
.....
Time taken: 0.587 seconds
hive (default)> quit;
[hdadmin@localhost ~]$
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
101. Reviewing HDFS Database Table files
Start Web Browser to http://localhost:50070/ then navigate to /user/hive/warehouse
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
102. Lecture
Understanding HBase
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
103. Introduction
An open source, non-relational, distributed database
HBase is an open source, non-relational, distributed database
modeled after Google's BigTable and is written in Java. It is
developed as part of Apache Software Foundation's Apache
Hadoop project and runs on top of HDFS (, providing
BigTable-like capabilities for Hadoop. That is, it provides a
fault-tolerant way of storing large quantities of sparse data.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
104. HBase Features
● Hadoop database modelled after Google's Bigtab;e
● Column oriented data store, known as Hadoop Database
● Support random realtime CRUD operations (unlike
HDFS)
● No SQL Database
● Opensource, written in Java
● Run on a cluster of commodity hardware
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
105. When to use Hbase?
● When you need high volume data to be stored
● Un-structured data
● Sparse data
● Column-oriented data
● Versioned data (same data template, captured at various
time, time-elapse data)
● When you need high scalability
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
106. Which one to use?
● HDFS
● Only append dataset (no random write)
● Read the whole dataset (no random read)
● HBase
● Need random write and/or read
● Has thousands of operation per second on TB+ of data
● RDBMS
● Data fits on one big node
● Need full transaction support
● Need real-time query capabilities
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
107. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
108. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
109. HBase Components
Hive.apache.org
● Region
● Row of table are stores
● Region Server
● Hosts the tables
● Master
● Coordinating the Region
Servers
● ZooKeeper
● HDFS
● API
● The Java Client API
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
110. HBase Shell Commands
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
111. Hands-On: Running HBase
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
112. Starting HBase shell
[hdadmin@localhost ~]$ start-hbase.sh
starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master-
localhost.localdomain.out
[hdadmin@localhost ~]$ jps
3064 TaskTracker
2836 SecondaryNameNode
2588 NameNode
3513 Jps
3327 HMaster
2938 JobTracker
2707 DataNode
[hdadmin@localhost ~]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013
hbase(main):001:0>
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
113. Create a table and insert data in HBase
hbase(main):009:0> create 'test', 'cf'
0 row(s) in 1.0830 seconds
hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'
0 row(s) in 0.0750 seconds
hbase(main):011:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1375363287644,
value=val1
1 row(s) in 0.0640 seconds
hbase(main):002:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=1375363287644, value=val1
1 row(s) in 0.0370 seconds
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
114. Using Data Browsers in Hue for HBase
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
115. Using Data Browsers in Hue for HBase
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
116. Recommendation to Further Study
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
117. Thank you
www.imcinstitute.com
www.facebook.com/imcinstitute
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013