SlideShare ist ein Scribd-Unternehmen logo
1 von 156
Downloaden Sie, um offline zu lesen
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1
Big Data using Hadoop
Hands On Workshop
March 2015
Dr.Thanachart Numnonda
Certified Java Programmer
thanachart@imcinstitute.com
Danairat T.
Certified Java Programmer, TOGAF – Silver
danairat@gmail.com, +66-81-559-1446
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Launch a virtual server
on EC2 Amazon Web Services
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hadoop Installation
Hadoop provides three installation choices:
1. Local mode: This is an unzip and run mode to
get you started right away where allparts of
Hadoop run within the same JVM
2. Pseudo distributed mode: This mode will be
run on different parts of Hadoop as different
Java processors, but within a single machine
3. Distributed mode: This is the real setup that
spans multiple machines
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Virtual Server
This lab will use a EC2 virtual server to install a
Hadoop server using the following features:
●
Ubuntu Server 14.04 LTS
●
m3.mediun 1vCPU, 3.75 GB memory
●
Security group: default
●
Keypair: imchadoop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Select a EC2 service and click on Lunch Instance
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Select an Amazon Machine Image (AMI) and
Ubuntu Server 14.04 LTS (PV)
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Choose m3.medium Type virtual server
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Leave configuration details as default
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Add Storage: 20 GB
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Name the instance
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Select an existing security group > Select Security
Group Name: default
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Click Launch and choose imchadoop as a key pair
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review an instance / click Connect for
an instruction to connect to the instance
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Connect to an instance from Mac/Linux
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Connect to an instance from Windows using Putty
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Connect to the instance
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Installing Hadoop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing Hadoop and Ecosystem
1. Update the system
2. Configuring SSH
3. Installing JDK1.6
4. Download/Extract Hadoop
5. Installing Hadoop
6. Configure xml files
7. Formatting HDFS
8. Start Hadoop
9. Hadoop Web Console
10. Stop Hadoop
Notes:-
Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4
stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will
encounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1) Update the system: sudo apt-get update
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
2. Configuring SSH: ssh-keygen
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Enabling SSH access to your local machine
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Testing the SSH setup by connecting to your local machine
$ ssh 54.68.149.232
Type Exit
$ exit
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
3) Install JDK 1.7: sudo apt-get install openjdk-7-jdk
(Enter Y when prompt for answering)
(Type command > java –version
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
4) Download/Extract Hadoop
1) Type command > wget
http://mirror.issp.co.th/apache/hadoop/common/hadoop-1.2.1/hadoop-
1.2.1.tar.gz
2) Type command > tar –xvzf hadoop-1.2.1.tar.gz
3) Type command > sudo mv hadoop-1.2.1 /usr/local/hadoop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
5) Installing Hadoop
1) Type command > sudo vi $HOME/.bashrc
2) Add config as figure below
1) Type command > exec bash
2) Type command > sudo vi /usr/local/hadoop/conf/hadoop-env.sh
3) Edit the file as figure below
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
6) Configuring Hadoop conf/*-site.xml
1. core-site.xml (hadoop.tmp.dir, fs.default.name)
2. hdfs-site.xml (dfs.replication)
3. mapred-site.xml (mapred.job.tracker)
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Configuring core-site.xml
1) Type command > sudo vi /usr/local/hadoop/conf/core-site.xml
2)Add Private IP of a server as figure below
(in this case a private IP is 172.31.12.11)
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Configuring mapred-site.xml
1) Type command > sudo sudo vi /usr/local/hadoop/conf/mapred-
site.xml
2)Add Private IP of Jobtracker server as figure below
(in this case a private IP is 172.31.12.11)
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Configuring hdfs-site.xml
1) Type command > sudo vi /usr/local/hadoop/conf/hdfs-site.xml
2)Add configure as figure below
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
7) Formating Hadoop
1)Type command > sudo mkdir /usr/local/hadoop/tmp
2)Type command > sudo chown ubuntu /usr/local/hadoop
3)Type command > sudo chown ubuntu /usr/local/hadoop/tmp
4)Type command > hadoop namenode –format
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Hadoop
ubuntu@ip-172-31-12-11:~$ start-all.sh
Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
[ubuntu@ip-172-31-12-11:~$ jps
11567 Jps
10766 NameNode
11099 JobTracker
11221 TaskTracker
10899 DataNode
11018 SecondaryNameNode
ubuntu@ip-172-31-12-11:~$$
Checking Java Process and you are now running Hadoop as pseudo distributed mode
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hadoop is up!
Viewing the Hadoop HDFS using WebUI
http://54.68.149.232:50070/
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Stopping Hadoop
ubuntu@ip-172-31-12-11:~$ /usr/local/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Importing Data to HDFS
using Hadoop Command Line
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Importing Data to Hadoop
Download War and Peace Full Text
www.gutenberg.org/ebooks/2600
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Importing Data to Hadoop
Download the file pg2600.txt
$ wget https://dl.dropboxusercontent.com/u/12655380/
pg2600.txt
$hadoop fs -mkdir /input
$hadoop fs -mkdir /output
$hadoop fs -copyFromLocal pg2600.txt /input
Import to Hadoop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Reviewing, Retrieving,
Deleting Data from HDFS
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review file in Hadoop HDFS
ubuntu@ip-172-31-12-11:~$ hadoop fs -cat /input/pg2600.txt
List HDFS File
Read HDFS File
Retrieve HDFS File to Local File System
Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
ubuntu@ip-172-31-12-11:~$ hadoop fs -copyToLocal /input/pg2600.txt /tmp/file.txt
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review file in Hadoop HDFS using WebUI
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hadoop Port Numbers
Daemon Default
Port
Configuration Parameter in
conf/*-site.xml
HDFS Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
MR JobTracker 50030 mapred.job.tracker.http.addre
ss
Tasktrackers 50060 mapred.task.tracker.http.addr
ess
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review Content from System shell
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Removing data from HDFS using
Shell Command
hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt
Deleted hdfs://localhost:54310/input/input_test.txt
hdadmin@localhost detach]$
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture: Understanding Map Reduce
Processing
Client
Name Node Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Map Reduce
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
High Level Architecture of MapReduce
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 45
Before MapReduce…
●
Large scale data processing was difficult!
– Managing hundreds or thousands of processors
– Managing parallelization and distribution
– I/O Scheduling
– Status and monitoring
– Fault/crash tolerance
●
MapReduce provides all of these, easily!
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 46
MapReduce Overview
●
What is it?
– Programming model used by Google
– A combination of the Map and Reduce models with an
associated implementation
– Used for processing and generating large data sets
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 47
MapReduce Overview
●
How does it solve our previously mentioned problems?
– MapReduce is highly scalable and can be used across many
computers.
– Many small machines can be used to process jobs that
normally could not be processed by a large machine.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MapReduce Framework
Source: www.bigdatauniversity.com
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 49
How Map and Reduce Work Together
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 50
How Map and Reduce Work Together
●
Map returns information
●
Reduces accepts information
●
Reduce applies a user defined function to reduce the
amount of data
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 51
Map Abstraction
●
Inputs a key/value pair
– Key is a reference to the input value
– Value is the data set on which to operate
●
Evaluation
– Function defined by user
– Applies to every value in value input
●
Might need to parse input
●
Produces a new list of key/value pairs
– Can be different type from input pair
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 52
Reduce Abstraction
●
Starts with intermediate Key / Value pairs
●
Ends with finalized Key / Value pairs
●
Starting pairs are sorted by key
●
Iterator supplies the values for a given key to the
Reduce function.
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 53
Reduce Abstraction
●
Typically a function that:
– Starts with a large number of key/value pairs
●
One key/value for each word in all files being greped
(including multiple entries for the same word)
– Ends with very few key/value pairs
●
One key/value for each unique word across all the files with
the number of instances summed into this entry
●
Broken up so a given worker works with input of the
same key.
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 54
Other Applications
●
Yahoo!
– Webmap application uses Hadoop to create a database of
information on all known webpages
●
Facebook
– Hive data center uses Hadoop to provide business statistics to
application developers and advertisers
●
Rackspace
– Analyzes sever log files and usage data using Hadoop
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 55
Why is this approach better?
●
Creates an abstraction for dealing with complex
overhead
– The computations are simple, the overhead is messy
●
Removing the overhead makes programs much
smaller and thus easier to use
– Less testing is required as well. The MapReduce
libraries can be assumed to work properly, so only
user code needs to be tested
●
Division of labor also handled by the
MapReduce libraries, so programmers only
need to focus on the actual computation
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MapReduce Framework
map: (K1, V1) -> list(K2, V2))
reduce: (K2, list(V2)) -> list(K3, V3)
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
How does the MapReduce work?
Output in a list of (Key, List of Values)
in the intermediate file
Sorting
Partitioning
Output in a list of (Key, Value)
in the intermediate file
InputSplit
RecordReader
RecordWriter
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
How does the MapReduce work?
Sorting
Partitioning
Combining
Car, 2
Car, 2
Bear, {1,1}
Car, {2,1}
River, {1,1}
Deer, {1,1}
Output in a list of (Key, List of Values)
in the intermediate file
Output in a list of (Key, Value)
in the intermediate file
InputSplit
RecordReader
RecordWriter
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MapReduce Processing – The Data
flow
1. InputFormat, InputSplits, RecordReader
2. Mapper - your focus is here
3. Partition, Shuffle & Sort
4. Reducer - your focus is here
5. OutputFormat, RecordWriter
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
InputFormat
InputFormat: Description: Key: Value:
TextInputFormat
Default format; reads
lines of text files
The byte offset of the
line
The line contents
KeyValueInputFormat
Parses lines into key,
val pairs
Everything up to the
first tab character
The remainder of the
line
SequenceFileInputFor
mat
A Hadoop-specific
high-performance
binary format
user-defined user-defined
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
InputSplit
An InputSplit describes a unit of work that comprises a single map
task.
InputSplit presents a byte-oriented view of the input.
You can control this value by setting the mapred.min.split.size
parameter in core-site.xml, or by overriding the parameter in the
JobConf object used to submit a particular MapReduce job.
RecordReader
RecordReader reads <key, value> pairs from an InputSplit.
Typically the RecordReader converts the byte-oriented view of
the input, provided by the InputSplit, and presents a record-
oriented to the Mapper
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Mapper
Mapper: The Mapper performs the user-defined logic to the input a
key, value and emits (key, value) pair(s) which are forwarded to the
Reducers.
Partition, Shuffle & Sort
After the first map tasks have completed, the nodes may still be
performing several more map tasks each. But they also begin
exchanging the intermediate outputs from the map tasks to where they
are required by the reducers.
Partitioner controls the partitioning of map-outputs to assign to reduce
task . he total number of partitions is the same as the number of reduce
tasks for the job
The set of intermediate keys on a single node is automatically sorted
by internal Hadoop before they are presented to the Reducer
This process of moving map outputs to the reducers is known as
shuffling.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reducer
This is an instance of user-provided code that performs read each
key, iterator of values in the partition assigned. The OutputCollector
object in Reducer phase has a method named collect() which will
collect a (key, value) output.
OutputFormat, Record Writer
OutputFormat governs the writing format in OutputCollector and
RecordWriter writes output into HDFS.
OutputFormat: Description
TextOutputFormat
Default; writes lines in "key t value"
form
SequenceFileOutputFormat
Writes binary files suitable for
reading into subsequent MapReduce
jobs
NullOutputFormat generates no output files
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Writing you own Map
Reduce Program
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
1. package org.myorg;
2.
3. import java.io.IOException;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14.
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {
15. private final static IntWritable one = new IntWritable(1);
16. private Text word = new Text();
17.
18.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
19. String line = value.toString();
20. StringTokenizer tokenizer = new StringTokenizer(line);
21. while (tokenizer.hasMoreTokens()) {
22. word.set(tokenizer.nextToken());
23. output.collect(word, one);
24. }
25. }
26. }
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
27.
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,
IntWritable> {
29.
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
30. int sum = 0;
31. while (values.hasNext()) {
32. sum += values.next().get();
33. }
34. output.collect(key, new IntWritable(sum));
35. }
36. }
37.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
38. public static void main(String[] args) throws Exception {
39. JobConf conf = new JobConf(WordCount.class);
40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46.
47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class);
50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[0]));
53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
54.
55. JobClient.runJob(conf);
57. }
58. }
59.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Packaging Map Reduce
and Deploying to Hadoop Runtime
Environment
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Packaging Map Reduce Program
Usage
Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version
installed, compile WordCount.java and create a jar:
$ wget https://dl.dropboxusercontent.com/u/12655380/WordCount.java
$ mkdir hduser
$ cd hduser
javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar -d hduser WordCount.java
$ jar -cvf ./wordcount.jar -C hduser/ .
$ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir
Output:
…….
$ hadoop fs -cat /output/wordcount_output_dir/part-00000
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Writing Map/Reduce
Program on Eclipse
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Eclipse
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Create a Java Project
Let's name it HadoopWordCount
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 79
Add dependencies to the project
●
Add the following two JARs to your build path
●
hadoop-common.jar and hadoop-mapreduce-client-core.jar. Both can be
founded at /usr/lib/hadoop/client
●
By perform the following steps
– Add a folder named lib to the project
– Copy the mentioned JARs in this folder
– Right-click on the project name >> select Build Path >> then
Configure Build Path
– Click on Add Jars, select these two JARs from the lib folder
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 80
Add dependencies to the project
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 81
Writing a source code
●
Right click the project, the select New >> Package
●
Name the package as org.myorg
●
Right click at org.myorg, the select New >> Class
●
Name the package as WordCount
●
Writing a source code as shown in previoud slides
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 82
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 83
Building a Jar file
●
Right click the project, the select Export
●
Select Java and then JAR file
●
Provide the JAR name, as wordcount.jar
●
Leave the JAR package options as default
●
In the JAR Manifest Specification section, in the botton, specify the Main
class
●
In this case, select WordCount
●
Click on Finish
●
The JAR file will be build and will be located at cloudera/workspace
Note: you may need to re-size the dialog font size by select
Windows >> Preferences >> Appearance >> Colors and Fonts
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture
Understanding Hive
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Introduction
A Petabyte Scale Data Warehouse Using Hadoop
Hive is developed by Facebook, designed to enable easy data
summarization, ad-hoc querying and analysis of large
volumes of data. It provides a simple query language called
Hive QL, which is based on SQL
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
What Hive is NOT
Hive is not designed for online transaction processing and
does not offer real-time queries and row level updates. It is
best used for batch jobs over large sets of immutable data
(like web logs, etc.).
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 87
Hive Metastore
●
Store Hive metadata
●
Configurations
– Embedded: in-process metastore, in-process database
– Local: in-process metastore, out-of-process database
– Remote: out-of-process metastore,out-of-process database
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 88
Hive Schema-On-Read
●
Faster loads into the database (simply copy or move)
●
Slower queries
●
Flexibility – multiple schemas for the same data
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 89
HiveQL
●
Hive Query Language
●
SQL dialect
●
No support for:
– UPDATE, DELETE
– Transactions
– Indexes
– HAVING clause in SELECT
– Updateable or materialized views
– Srored procedure
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 90
Hive Tables
●
Managed- CREATE TABLE
– LOAD- File moved into Hive's data warehouse directory
– DROP- Both data and metadata are deleted.
●
External- CREATE EXTERNAL TABLE
– LOAD- No file moved
– DROP- Only metadata deleted
– Use when sharing data between Hive and Hadoop applications
or you want to use multiple schema on the same data
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Running Hive
Hive Shell
●
Interactive
hive
●
Script
hive -f myscript
●
Inline
hive -e 'SELECT * FROM mytable'
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
System Architecture and Components
•
Metastore: To store the meta data.
•
Query compiler and execution engine: To convert SQL queries to a
sequence of map/reduce jobs that are then executed on Hadoop.
•
SerDe and ObjectInspectors: Programmable interfaces and
implementations of common data formats and types.
A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java
object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.
•
UDF and UDAF: Programmable interfaces and implementations for
user defined functions (scalar and aggregate functions).
•
Clients: Command line client similar to Mysql command line.
hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Architecture Overview
HDFS
Hive CLI
QueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.
WebUI
HDFS
DDL
Hive
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Sample HiveQL
The Query compiler uses the information stored in the metastore to
convert SQL queries into a sequence of map/reduce jobs, e.g. the
following query
SELECT * FROM t where t.c = 'xyz'
SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)
SELECT t1.c1, count(1) from t1 group by t1.c1
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Creating Table and
Retrieving Data using Hive
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hive Hands-On Labs
1. Installing Hive
2. Configuring / Starting Hive
3. Creating Hive Table
4. Reviewing Hive Table in HDFS
5. Alter and Drop Hive Table
6. Preparing Dataset
7. Loading Data to Hive Table
8. Querying Data from Hive Table
9. Reviewing Hive Table Content from HDFS Command
and WebUI
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1. Installing Hive
# wget http://apache.mesi.com.ar/hive/hive-1.1.0/
apache-hive-1.1.0-bin.tar.gz
# tar -xvzf apache-hive-1.1.0-bin.tar.gz
# sudo mv apache-hive-1.1.0-bin /usr/local
# rm apache-hive-1.1.0-bin.tar.gz
Install Hive binary file
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1. Installing Hive
Edit $HOME ./bashrc
# sudo vi $HOME/.bashrc
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
2. Configuring Hive
Creating HDFS Directory for Hive
Create hdfs /tmp and /user/hive/warehouse directory
[hdadmin@localhost ~]$ hadoop fs -mkdir /tmp/hive
[hdadmin@localhost ~]$ hadoop fs -mkdir /user/hive/warehouse
[hdadmin@localhost ~]$ hadoop fs -chmod 777 /tmp/hive
[hdadmin@localhost ~]$ hadoop fs -chmod 777 /user/hive/warehouse
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
2. Start Hive
Starting Hive
hive> quit;
Quit from Hive
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
3. Creating Hive Table
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 4.069 seconds
hive (default)> show tables;
OK
test_tbl
Time taken: 0.138 seconds
hive (default)> describe test_tbl;
OK
id int
country string
Time taken: 0.147 seconds
hive (default)>
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
4. Reviewing Hive Table in HDFS
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl
[hdadmin@localhost hdadmin]$
Review Hive Table from
HDFS WebUI
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
5. Alter and Drop Hive Table
hive (default)> alter table test_tbl add columns (remarks STRING);
hive (default)> describe test_tbl;
OK
id int
country string
remarks string
Time taken: 0.077 seconds
hive (default)> drop table test_tbl;
OK
Time taken: 0.9 seconds
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
6. Preparing Large Dataset
http://grouplens.org/datasets/movielens/
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MovieLen Dataset
1)Type command > wget
http://files.grouplens.org/datasets/movielens/ml-100k.zip
2)Type command > sudo apt-get install unzip
3)Type command > unzip ml-100k.zip
4)Type command > more ml-100k/u.user
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
6. Loading Data to Hive Table
hive (default)> exit;
ubuntu@ip-172-31-12-11:~/ml-100k$ hadoop fs -put u.user /dataset/movielens/users
Loading data to Hive table
$ hive
hive (default)> CREATE EXTERNAL TABLE users (userid INT, age INT,
gender STRING, occupation STRING, zipcode STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '/dataset/movielens/users';
Creating Hive table
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
7. Querying Data from Hive Table
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
8. Loading Data to test_tbl Table
$ hive
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Creating Hive table
hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLE
test_tbl;
Copying data from file:/tmp/test_tbl_data.csv
Copying file: file:/tmp/test_tbl_data.csv
Loading data to table default.test_tbl
OK
Time taken: 0.241 seconds
hive (default)>
Loading data to Hive table
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
9. Reviewing Hive Table Content from HDFS Command
and WebUI
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl
Found 1 items
-rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08
/user/hive/warehouse/test_tbl/test_tbl_data.csv
[hdadmin@localhost hdadmin]$
[hdadmin@localhost hdadmin]$ hadoop fs -cat
/user/hive/warehouse/test_tbl/test_tbl_data.csv
1,USA
62,Indonesia
63,Philippines
65,Singapore
66,Thailand
[hdadmin@localhost hdadmin]$
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Loading Data to Hive Table
$ hive
hive (default)> hive> CREATE TABLE products
(
prod_name STRING,
description STRING,
category STRING,
qty_on_hand INT,
prod_num STRING,
packaged_with ARRAY<STRING>
)
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
Creating Hive table
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture
Understanding Pig
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Introduction
A high-level platform for creating MapReduce programs Using Hadoop
Pig is a platform for analyzing large data sets that consists of
a high-level language for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables
them to handle very large data sets.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig Components
●
Two Compnents
●
Language (Pig Latin)
●
Compiler
●
Two Execution Environments
●
Local
pig -x local
●
Distributed
pig -x mapreduce
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Running Pig
●
Script
pig myscript
●
Command line (Grunt)
pig
●
Embedded
Writing a java program
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig Latin
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig Execution Stages
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Why Pig?
●
Makes writing Hadoop jobs easier
●
5% of the code, 5% of the time
●
You don't need to be a programmer to write Pig scripts
●
Provide major functionality required for
DatawareHouse and Analytics
●
Load, Filter, Join, Group By, Order, Transform
●
User can write custom UDFs (User Defined Function)
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig v.s. Hive
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running a Pig script
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing Pig
# wget
http://archive.apache.org/dist/hadoop/pig/stable/
pig-0.7.0.tar.gz
# tar -xvzf pig-0.7.0.tar.gz
# sudo mv pig-0.7.0 /usr/local/
# rm pig-0.7.0.tar.gz
Install Pig binary file
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing Pig
Edit $HOME ./bashrc
# sudo vi $HOME/.bashrc
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Pig Command Line
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
countryFilter.pig
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float,
lifeex:int, mysch:int, eysch:int, gni:int);
B = FILTER A BY gni > 2000;
C = ORDER B BY gni;
dump C;
#Preparing Data
ubuntu@ip-172-31-12-11:~$ wget https://www.dropbox.com/s/pp168a6oiwqkxyu/
hdi-data.csv
#Edit Your Script
ubuntu@ip-172-31-12-11:~$ vi countryFilter.pig
Writing a Pig Script
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
ubuntu@ip-172-31-12-11:~$ pig -x local
grunt > run countryFilter.pig
Running a Pig Script
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture: Understanding Sqoop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Introduction
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line
tool with the following capabilities:
•
Imports individual tables or entire databases to files in
HDFS
•
Generates Java classes to allow you to interact with your
imported data
•
Provides the ability to import from SQL databases straight
into your Hive data warehouse
See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Architecture Overview
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Loading Data from DBMS
to Hadoop HDFS
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Sqoop Hands-On Labs
1. Loading Data into MySQL DB
2. Installing Sqoop
3. Configuring Sqoop
4. Installing DB driver for Sqoop
5. Importing data from MySQL to Hive Table
6. Reviewing data from Hive Table
7. Reviewing HDFS Database Table files
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1. MySQL RDS Server on AWS
A RDS Server is running on AWS with the following
configuration
> database: imc_db
> username: admin
> password: imcinstitute
>addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com
[This address may change]
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1. country_tbl data
Testing data query from MySQL DB
Table name > country_tbl
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
2. Installing Sqoop
# wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-
1.0.0.tar.gz
# tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz
# sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/
# rm sqoop-1.4.5.bin__hadoop-1.0.0
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing Sqoop
Edit $HOME ./bashrc
# sudo vi $HOME/.bashrc
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
3. Configuring Sqoop
ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-
1.0.0/conf/
ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
4. Installing DB driver for Sqoop
ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-
1.0.0/lib/
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$
wget
https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$
exit
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
5. Importing data from MySQL to Hive Table
[hdadmin@localhost ~]$sqoop import --connect
jdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west-
2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl
--hive-import --hive-table country -m 1
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Enter password: <enter here>
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
6. Reviewing data from Hive Table
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
7. Reviewing HDFS Database Table files
Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
7. Reviewing HDFS Database Table files
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture
Understanding HBase
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Introduction
An open source, non-relational, distributed database
HBase is an open source, non-relational, distributed database
modeled after Google's BigTable and is written in Java. It is
developed as part of Apache Software Foundation's Apache
Hadoop project and runs on top of HDFS (, providing
BigTable-like capabilities for Hadoop. That is, it provides a
fault-tolerant way of storing large quantities of sparse data.
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Features
●
Hadoop database modelled after Google's Bigtab;e
●
Column oriented data store, known as Hadoop Database
●
Support random realtime CRUD operations (unlike
HDFS)
●
No SQL Database
●
Opensource, written in Java
●
Run on a cluster of commodity hardware
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
When to use Hbase?
●
When you need high volume data to be stored
●
Un-structured data
●
Sparse data
●
Column-oriented data
●
Versioned data (same data template, captured at various
time, time-elapse data)
●
When you need high scalability
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Which one to use?
●
HDFS
●
Only append dataset (no random write)
●
Read the whole dataset (no random read)
●
HBase
●
Need random write and/or read
●
Has thousands of operation per second on TB+ of data
●
RDBMS
●
Data fits on one big node
●
Need full transaction support
●
Need real-time query capabilities
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Components
Hive.apache.org
●
Region
●
Row of table are stores
●
Region Server
●
Hosts the tables
●
Master
●
Coordinating the Region
Servers
●
ZooKeeper
●
HDFS
●
API
●
The Java Client API
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Architecture
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Shell Commands
Hive.apache.org
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running HBase
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing HBase
# wget http://apache.cs.utah.edu/hbase/hbase-1.0.0/hbase-1.0.0-bin.tar.gz
# tar -xvzf hbase-1.0.0-bin.tar.gz
# sudo mv hbase-1.0.0 /usr/local/
# rm hbase-1.0.0-bin.tar.gz
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing HBase
Edit $HOME ./bashrc
# sudo vi $HOME/.bashrc
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting HBase shell
ubuntu@ip-172-31-12-11:~$ start-hbase.sh
starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-
master-localhost.localdomain.out
ubuntu@ip-172-31-12-11:~$$ jps
3064 TaskTracker
2836 SecondaryNameNode
2588 NameNode
3513 Jps
3327 HMaster
2938 JobTracker
2707 DataNode
ubuntu@ip-172-31-12-11:~$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013
hbase(main):001:0>
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Create a table and insert data in HBase
hbase(main):009:0> create 'test', 'cf'
0 row(s) in 1.0830 seconds
hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'
0 row(s) in 0.0750 seconds
hbase(main):011:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1375363287644,
value=val1
1 row(s) in 0.0640 seconds
hbase(main):002:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=1375363287644, value=val1
1 row(s) in 0.0370 seconds
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Recommendation to Further Study
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Thank you
www.imcinstitute.com
www.facebook.com/imcinstitute

Weitere ähnliche Inhalte

Was ist angesagt?

Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1IMC Institute
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in ActionIMC Institute
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveIMC Institute
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2IMC Institute
 
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02matrixvn
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]David Przybilla
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
New developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeNew developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeXiao Li
 
Apache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのかApache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのかKouhei Sutou
 
How Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm PipelinesHow Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm PipelinesKinshuk Mishra
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsAvkash Chauhan
 
API analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters editionAPI analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters editionjavier ramirez
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Time to Science, Time to Results. Accelerating Scientific research in the Cloud
Time to Science, Time to Results. Accelerating Scientific research in the CloudTime to Science, Time to Results. Accelerating Scientific research in the Cloud
Time to Science, Time to Results. Accelerating Scientific research in the CloudAmazon Web Services
 
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRAGRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRAShaunak Das
 
Enterprise Integration Pattern - Mule Soft Scatter gather
Enterprise Integration Pattern - Mule Soft Scatter gatherEnterprise Integration Pattern - Mule Soft Scatter gather
Enterprise Integration Pattern - Mule Soft Scatter gatherAyan Bhattacharjee
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Ferran Galí Reniu
 

Was ist angesagt? (20)

Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in Action
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Linux intermediate level
Linux intermediate levelLinux intermediate level
Linux intermediate level
 
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
New developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeNew developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lake
 
Apache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのかApache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのか
 
How Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm PipelinesHow Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm Pipelines
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
 
API analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters editionAPI analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters edition
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Time to Science, Time to Results. Accelerating Scientific research in the Cloud
Time to Science, Time to Results. Accelerating Scientific research in the CloudTime to Science, Time to Results. Accelerating Scientific research in the Cloud
Time to Science, Time to Results. Accelerating Scientific research in the Cloud
 
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRAGRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
 
Aws r
Aws rAws r
Aws r
 
Enterprise Integration Pattern - Mule Soft Scatter gather
Enterprise Integration Pattern - Mule Soft Scatter gatherEnterprise Integration Pattern - Mule Soft Scatter gather
Enterprise Integration Pattern - Mule Soft Scatter gather
 
DataFu @ ApacheCon 2014
DataFu @ ApacheCon 2014DataFu @ ApacheCon 2014
DataFu @ ApacheCon 2014
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 

Andere mochten auch

Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Daniel Madrigal
 
Data Science Crash Course Hadoop Summit SJ
Data Science Crash Course Hadoop Summit SJData Science Crash Course Hadoop Summit SJ
Data Science Crash Course Hadoop Summit SJDaniel Madrigal
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลCloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลIMC Institute
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public CloudIMC Institute
 
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SMEการบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SMEIMC Institute
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformIMC Institute
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIMC Institute
 
Thailand ICT Review 2014
Thailand ICT Review 2014Thailand ICT Review 2014
Thailand ICT Review 2014IMC Institute
 
Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark DataWorks Summit/Hadoop Summit
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a ServiceIMC Institute
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in ChinaIMC Institute
 
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016IMC Institute
 
Big data project management
Big data project managementBig data project management
Big data project managementIMC Institute
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015IMC Institute
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
เทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษาเทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษา
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษาIMC Institute
 

Andere mochten auch (20)

Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ
 
Data Science Crash Course Hadoop Summit SJ
Data Science Crash Course Hadoop Summit SJData Science Crash Course Hadoop Summit SJ
Data Science Crash Course Hadoop Summit SJ
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลCloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public Cloud
 
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SMEการบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud Platform
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
 
Thailand ICT Review 2014
Thailand ICT Review 2014Thailand ICT Review 2014
Thailand ICT Review 2014
 
Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a Service
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in China
 
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
 
Big data project management
Big data project managementBig data project management
Big data project management
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
เทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษาเทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษา
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
 

Ähnlich wie Hadoop Workshop on EC2 : March 2015

Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsIMC Institute
 
Big Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopBig Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopIMC Institute
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)IMC Institute
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Deploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise EnvironmentsDeploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise Environmentsinovex GmbH
 
Ubuntu And Parental Controls
Ubuntu And Parental ControlsUbuntu And Parental Controls
Ubuntu And Parental Controlsjasonholtzapple
 
Hadoop 101 handson Lab
Hadoop 101 handson LabHadoop 101 handson Lab
Hadoop 101 handson LabSunil Ranka
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)DataWorks Summit
 
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...Amazon Web Services
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Anand Sampat
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSDataWorks Summit
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by KeylabsSiva Sankar
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with ExamplesJoe McTee
 
Apache bigtopwg7142013
Apache bigtopwg7142013Apache bigtopwg7142013
Apache bigtopwg7142013Doug Chang
 

Ähnlich wie Hadoop Workshop on EC2 : March 2015 (20)

Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
Big Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopBig Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop Workshop
 
Setting up Hadoop YARN Clustering
Setting up Hadoop YARN ClusteringSetting up Hadoop YARN Clustering
Setting up Hadoop YARN Clustering
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Deploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise EnvironmentsDeploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise Environments
 
Ubuntu And Parental Controls
Ubuntu And Parental ControlsUbuntu And Parental Controls
Ubuntu And Parental Controls
 
Hadoop 101 handson Lab
Hadoop 101 handson LabHadoop 101 handson Lab
Hadoop 101 handson Lab
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
 
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
Django Deployment-in-AWS
Django Deployment-in-AWSDjango Deployment-in-AWS
Django Deployment-in-AWS
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Introduction to HCFS
Introduction to HCFSIntroduction to HCFS
Introduction to HCFS
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
Apache bigtopwg7142013
Apache bigtopwg7142013Apache bigtopwg7142013
Apache bigtopwg7142013
 

Mehr von IMC Institute

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14IMC Institute
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019IMC Institute
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AIIMC Institute
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12IMC Institute
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital TransformationIMC Institute
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIMC Institute
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมIMC Institute
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11IMC Institute
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationIMC Institute
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon ValleyIMC Institute
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10IMC Institute
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationIMC Institute
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)IMC Institute
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง IMC Institute
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9 IMC Institute
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016IMC Institute
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger IMC Institute
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.orgIMC Institute
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgIMC Institute
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital TransformationIMC Institute
 

Mehr von IMC Institute (20)

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AI
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to Work
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon Valley
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.org
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.org
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
 

Kürzlich hochgeladen

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 

Kürzlich hochgeladen (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

Hadoop Workshop on EC2 : March 2015

  • 1. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1 Big Data using Hadoop Hands On Workshop March 2015 Dr.Thanachart Numnonda Certified Java Programmer thanachart@imcinstitute.com Danairat T. Certified Java Programmer, TOGAF – Silver danairat@gmail.com, +66-81-559-1446
  • 2. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Launch a virtual server on EC2 Amazon Web Services
  • 3. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  • 4. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hadoop Installation Hadoop provides three installation choices: 1. Local mode: This is an unzip and run mode to get you started right away where allparts of Hadoop run within the same JVM 2. Pseudo distributed mode: This mode will be run on different parts of Hadoop as different Java processors, but within a single machine 3. Distributed mode: This is the real setup that spans multiple machines
  • 5. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Virtual Server This lab will use a EC2 virtual server to install a Hadoop server using the following features: ● Ubuntu Server 14.04 LTS ● m3.mediun 1vCPU, 3.75 GB memory ● Security group: default ● Keypair: imchadoop
  • 6. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Select a EC2 service and click on Lunch Instance
  • 7. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Select an Amazon Machine Image (AMI) and Ubuntu Server 14.04 LTS (PV)
  • 8. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Choose m3.medium Type virtual server
  • 9. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Leave configuration details as default
  • 10. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Add Storage: 20 GB
  • 11. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Name the instance
  • 12. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Select an existing security group > Select Security Group Name: default
  • 13. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Click Launch and choose imchadoop as a key pair
  • 14. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review an instance / click Connect for an instruction to connect to the instance
  • 15. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Connect to an instance from Mac/Linux
  • 16. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Connect to an instance from Windows using Putty
  • 17. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Connect to the instance
  • 18. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Installing Hadoop
  • 19. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Hadoop and Ecosystem 1. Update the system 2. Configuring SSH 3. Installing JDK1.6 4. Download/Extract Hadoop 5. Installing Hadoop 6. Configure xml files 7. Formatting HDFS 8. Start Hadoop 9. Hadoop Web Console 10. Stop Hadoop Notes:- Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6
  • 20. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1) Update the system: sudo apt-get update
  • 21. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 2. Configuring SSH: ssh-keygen
  • 22. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Enabling SSH access to your local machine $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Testing the SSH setup by connecting to your local machine $ ssh 54.68.149.232 Type Exit $ exit
  • 23. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 3) Install JDK 1.7: sudo apt-get install openjdk-7-jdk (Enter Y when prompt for answering) (Type command > java –version
  • 24. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 4) Download/Extract Hadoop 1) Type command > wget http://mirror.issp.co.th/apache/hadoop/common/hadoop-1.2.1/hadoop- 1.2.1.tar.gz 2) Type command > tar –xvzf hadoop-1.2.1.tar.gz 3) Type command > sudo mv hadoop-1.2.1 /usr/local/hadoop
  • 25. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 5) Installing Hadoop 1) Type command > sudo vi $HOME/.bashrc 2) Add config as figure below 1) Type command > exec bash 2) Type command > sudo vi /usr/local/hadoop/conf/hadoop-env.sh 3) Edit the file as figure below
  • 26. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 6) Configuring Hadoop conf/*-site.xml 1. core-site.xml (hadoop.tmp.dir, fs.default.name) 2. hdfs-site.xml (dfs.replication) 3. mapred-site.xml (mapred.job.tracker)
  • 27. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Configuring core-site.xml 1) Type command > sudo vi /usr/local/hadoop/conf/core-site.xml 2)Add Private IP of a server as figure below (in this case a private IP is 172.31.12.11)
  • 28. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Configuring mapred-site.xml 1) Type command > sudo sudo vi /usr/local/hadoop/conf/mapred- site.xml 2)Add Private IP of Jobtracker server as figure below (in this case a private IP is 172.31.12.11)
  • 29. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Configuring hdfs-site.xml 1) Type command > sudo vi /usr/local/hadoop/conf/hdfs-site.xml 2)Add configure as figure below
  • 30. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 7) Formating Hadoop 1)Type command > sudo mkdir /usr/local/hadoop/tmp 2)Type command > sudo chown ubuntu /usr/local/hadoop 3)Type command > sudo chown ubuntu /usr/local/hadoop/tmp 4)Type command > hadoop namenode –format
  • 31. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Hadoop ubuntu@ip-172-31-12-11:~$ start-all.sh Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine. [ubuntu@ip-172-31-12-11:~$ jps 11567 Jps 10766 NameNode 11099 JobTracker 11221 TaskTracker 10899 DataNode 11018 SecondaryNameNode ubuntu@ip-172-31-12-11:~$$ Checking Java Process and you are now running Hadoop as pseudo distributed mode
  • 32. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hadoop is up! Viewing the Hadoop HDFS using WebUI http://54.68.149.232:50070/
  • 33. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Stopping Hadoop ubuntu@ip-172-31-12-11:~$ /usr/local/hadoop/bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode
  • 34. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Importing Data to HDFS using Hadoop Command Line
  • 35. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Importing Data to Hadoop Download War and Peace Full Text www.gutenberg.org/ebooks/2600
  • 36. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Importing Data to Hadoop Download the file pg2600.txt $ wget https://dl.dropboxusercontent.com/u/12655380/ pg2600.txt $hadoop fs -mkdir /input $hadoop fs -mkdir /output $hadoop fs -copyFromLocal pg2600.txt /input Import to Hadoop
  • 37. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Reviewing, Retrieving, Deleting Data from HDFS
  • 38. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS ubuntu@ip-172-31-12-11:~$ hadoop fs -cat /input/pg2600.txt List HDFS File Read HDFS File Retrieve HDFS File to Local File System Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html ubuntu@ip-172-31-12-11:~$ hadoop fs -copyToLocal /input/pg2600.txt /tmp/file.txt
  • 39. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using WebUI
  • 40. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hadoop Port Numbers Daemon Default Port Configuration Parameter in conf/*-site.xml HDFS Namenode 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address MR JobTracker 50030 mapred.job.tracker.http.addre ss Tasktrackers 50060 mapred.task.tracker.http.addr ess
  • 41. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review Content from System shell
  • 42. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Removing data from HDFS using Shell Command hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt Deleted hdfs://localhost:54310/input/input_test.txt hdadmin@localhost detach]$
  • 43. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture: Understanding Map Reduce Processing Client Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Map Reduce
  • 44. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop High Level Architecture of MapReduce
  • 45. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 45 Before MapReduce… ● Large scale data processing was difficult! – Managing hundreds or thousands of processors – Managing parallelization and distribution – I/O Scheduling – Status and monitoring – Fault/crash tolerance ● MapReduce provides all of these, easily! Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
  • 46. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 46 MapReduce Overview ● What is it? – Programming model used by Google – A combination of the Map and Reduce models with an associated implementation – Used for processing and generating large data sets
  • 47. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 47 MapReduce Overview ● How does it solve our previously mentioned problems? – MapReduce is highly scalable and can be used across many computers. – Many small machines can be used to process jobs that normally could not be processed by a large machine.
  • 48. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MapReduce Framework Source: www.bigdatauniversity.com
  • 49. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 49 How Map and Reduce Work Together
  • 50. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 50 How Map and Reduce Work Together ● Map returns information ● Reduces accepts information ● Reduce applies a user defined function to reduce the amount of data
  • 51. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 51 Map Abstraction ● Inputs a key/value pair – Key is a reference to the input value – Value is the data set on which to operate ● Evaluation – Function defined by user – Applies to every value in value input ● Might need to parse input ● Produces a new list of key/value pairs – Can be different type from input pair
  • 52. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 52 Reduce Abstraction ● Starts with intermediate Key / Value pairs ● Ends with finalized Key / Value pairs ● Starting pairs are sorted by key ● Iterator supplies the values for a given key to the Reduce function.
  • 53. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 53 Reduce Abstraction ● Typically a function that: – Starts with a large number of key/value pairs ● One key/value for each word in all files being greped (including multiple entries for the same word) – Ends with very few key/value pairs ● One key/value for each unique word across all the files with the number of instances summed into this entry ● Broken up so a given worker works with input of the same key.
  • 54. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 54 Other Applications ● Yahoo! – Webmap application uses Hadoop to create a database of information on all known webpages ● Facebook – Hive data center uses Hadoop to provide business statistics to application developers and advertisers ● Rackspace – Analyzes sever log files and usage data using Hadoop
  • 55. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 55 Why is this approach better? ● Creates an abstraction for dealing with complex overhead – The computations are simple, the overhead is messy ● Removing the overhead makes programs much smaller and thus easier to use – Less testing is required as well. The MapReduce libraries can be assumed to work properly, so only user code needs to be tested ● Division of labor also handled by the MapReduce libraries, so programmers only need to focus on the actual computation
  • 56. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MapReduce Framework map: (K1, V1) -> list(K2, V2)) reduce: (K2, list(V2)) -> list(K3, V3)
  • 57. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop How does the MapReduce work? Output in a list of (Key, List of Values) in the intermediate file Sorting Partitioning Output in a list of (Key, Value) in the intermediate file InputSplit RecordReader RecordWriter
  • 58. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop How does the MapReduce work? Sorting Partitioning Combining Car, 2 Car, 2 Bear, {1,1} Car, {2,1} River, {1,1} Deer, {1,1} Output in a list of (Key, List of Values) in the intermediate file Output in a list of (Key, Value) in the intermediate file InputSplit RecordReader RecordWriter
  • 59. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MapReduce Processing – The Data flow 1. InputFormat, InputSplits, RecordReader 2. Mapper - your focus is here 3. Partition, Shuffle & Sort 4. Reducer - your focus is here 5. OutputFormat, RecordWriter
  • 60. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop InputFormat InputFormat: Description: Key: Value: TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputFormat Parses lines into key, val pairs Everything up to the first tab character The remainder of the line SequenceFileInputFor mat A Hadoop-specific high-performance binary format user-defined user-defined
  • 61. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop InputSplit An InputSplit describes a unit of work that comprises a single map task. InputSplit presents a byte-oriented view of the input. You can control this value by setting the mapred.min.split.size parameter in core-site.xml, or by overriding the parameter in the JobConf object used to submit a particular MapReduce job. RecordReader RecordReader reads <key, value> pairs from an InputSplit. Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record- oriented to the Mapper
  • 62. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Mapper Mapper: The Mapper performs the user-defined logic to the input a key, value and emits (key, value) pair(s) which are forwarded to the Reducers. Partition, Shuffle & Sort After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. Partitioner controls the partitioning of map-outputs to assign to reduce task . he total number of partitions is the same as the number of reduce tasks for the job The set of intermediate keys on a single node is automatically sorted by internal Hadoop before they are presented to the Reducer This process of moving map outputs to the reducers is known as shuffling.
  • 63. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reducer This is an instance of user-provided code that performs read each key, iterator of values in the partition assigned. The OutputCollector object in Reducer phase has a method named collect() which will collect a (key, value) output. OutputFormat, Record Writer OutputFormat governs the writing format in OutputCollector and RecordWriter writes output into HDFS. OutputFormat: Description TextOutputFormat Default; writes lines in "key t value" form SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs NullOutputFormat generates no output files
  • 64. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Writing you own Map Reduce Program
  • 65. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 1. package org.myorg; 2. 3. import java.io.IOException; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; 11. 12. public class WordCount { 13. 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }
  • 66. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 27. 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37.
  • 67. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount"); 41. 42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class); 44. 45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class); 48. 49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59.
  • 68. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 69. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Packaging Map Reduce Program Usage Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar: $ wget https://dl.dropboxusercontent.com/u/12655380/WordCount.java $ mkdir hduser $ cd hduser javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar -d hduser WordCount.java $ jar -cvf ./wordcount.jar -C hduser/ . $ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir Output: ……. $ hadoop fs -cat /output/wordcount_output_dir/part-00000
  • 70. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  • 71. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  • 72. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  • 73. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  • 74. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  • 75. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  • 76. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Writing Map/Reduce Program on Eclipse
  • 77. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Eclipse
  • 78. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Create a Java Project Let's name it HadoopWordCount
  • 79. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 79 Add dependencies to the project ● Add the following two JARs to your build path ● hadoop-common.jar and hadoop-mapreduce-client-core.jar. Both can be founded at /usr/lib/hadoop/client ● By perform the following steps – Add a folder named lib to the project – Copy the mentioned JARs in this folder – Right-click on the project name >> select Build Path >> then Configure Build Path – Click on Add Jars, select these two JARs from the lib folder
  • 80. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 80 Add dependencies to the project
  • 81. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 81 Writing a source code ● Right click the project, the select New >> Package ● Name the package as org.myorg ● Right click at org.myorg, the select New >> Class ● Name the package as WordCount ● Writing a source code as shown in previoud slides
  • 82. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 82
  • 83. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 83 Building a Jar file ● Right click the project, the select Export ● Select Java and then JAR file ● Provide the JAR name, as wordcount.jar ● Leave the JAR package options as default ● In the JAR Manifest Specification section, in the botton, specify the Main class ● In this case, select WordCount ● Click on Finish ● The JAR file will be build and will be located at cloudera/workspace Note: you may need to re-size the dialog font size by select Windows >> Preferences >> Appearance >> Colors and Fonts
  • 84. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding Hive
  • 85. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction A Petabyte Scale Data Warehouse Using Hadoop Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL
  • 86. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop What Hive is NOT Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.).
  • 87. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 87 Hive Metastore ● Store Hive metadata ● Configurations – Embedded: in-process metastore, in-process database – Local: in-process metastore, out-of-process database – Remote: out-of-process metastore,out-of-process database
  • 88. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 88 Hive Schema-On-Read ● Faster loads into the database (simply copy or move) ● Slower queries ● Flexibility – multiple schemas for the same data
  • 89. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 89 HiveQL ● Hive Query Language ● SQL dialect ● No support for: – UPDATE, DELETE – Transactions – Indexes – HAVING clause in SELECT – Updateable or materialized views – Srored procedure
  • 90. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 90 Hive Tables ● Managed- CREATE TABLE – LOAD- File moved into Hive's data warehouse directory – DROP- Both data and metadata are deleted. ● External- CREATE EXTERNAL TABLE – LOAD- No file moved – DROP- Only metadata deleted – Use when sharing data between Hive and Hadoop applications or you want to use multiple schema on the same data
  • 91. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running Hive Hive Shell ● Interactive hive ● Script hive -f myscript ● Inline hive -e 'SELECT * FROM mytable' Hive.apache.org
  • 92. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop System Architecture and Components • Metastore: To store the meta data. • Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop. • SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. • UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions). • Clients: Command line client similar to Mysql command line. hive.apache.org
  • 93. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Architecture Overview HDFS Hive CLI QueriesBrowsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. WebUI HDFS DDL Hive Hive.apache.org
  • 94. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Sample HiveQL The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query SELECT * FROM t where t.c = 'xyz' SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1) SELECT t1.c1, count(1) from t1 group by t1.c1 Hive.apache.org
  • 95. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Creating Table and Retrieving Data using Hive
  • 96. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hive Hands-On Labs 1. Installing Hive 2. Configuring / Starting Hive 3. Creating Hive Table 4. Reviewing Hive Table in HDFS 5. Alter and Drop Hive Table 6. Preparing Dataset 7. Loading Data to Hive Table 8. Querying Data from Hive Table 9. Reviewing Hive Table Content from HDFS Command and WebUI
  • 97. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1. Installing Hive # wget http://apache.mesi.com.ar/hive/hive-1.1.0/ apache-hive-1.1.0-bin.tar.gz # tar -xvzf apache-hive-1.1.0-bin.tar.gz # sudo mv apache-hive-1.1.0-bin /usr/local # rm apache-hive-1.1.0-bin.tar.gz Install Hive binary file
  • 98. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1. Installing Hive Edit $HOME ./bashrc # sudo vi $HOME/.bashrc
  • 99. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 2. Configuring Hive Creating HDFS Directory for Hive Create hdfs /tmp and /user/hive/warehouse directory [hdadmin@localhost ~]$ hadoop fs -mkdir /tmp/hive [hdadmin@localhost ~]$ hadoop fs -mkdir /user/hive/warehouse [hdadmin@localhost ~]$ hadoop fs -chmod 777 /tmp/hive [hdadmin@localhost ~]$ hadoop fs -chmod 777 /user/hive/warehouse
  • 100. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 2. Start Hive Starting Hive hive> quit; Quit from Hive
  • 101. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 3. Creating Hive Table hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 4.069 seconds hive (default)> show tables; OK test_tbl Time taken: 0.138 seconds hive (default)> describe test_tbl; OK id int country string Time taken: 0.147 seconds hive (default)> See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
  • 102. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 4. Reviewing Hive Table in HDFS [hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse Found 1 items drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl [hdadmin@localhost hdadmin]$ Review Hive Table from HDFS WebUI
  • 103. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 5. Alter and Drop Hive Table hive (default)> alter table test_tbl add columns (remarks STRING); hive (default)> describe test_tbl; OK id int country string remarks string Time taken: 0.077 seconds hive (default)> drop table test_tbl; OK Time taken: 0.9 seconds See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
  • 104. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 6. Preparing Large Dataset http://grouplens.org/datasets/movielens/
  • 105. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MovieLen Dataset 1)Type command > wget http://files.grouplens.org/datasets/movielens/ml-100k.zip 2)Type command > sudo apt-get install unzip 3)Type command > unzip ml-100k.zip 4)Type command > more ml-100k/u.user
  • 106. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 6. Loading Data to Hive Table hive (default)> exit; ubuntu@ip-172-31-12-11:~/ml-100k$ hadoop fs -put u.user /dataset/movielens/users Loading data to Hive table $ hive hive (default)> CREATE EXTERNAL TABLE users (userid INT, age INT, gender STRING, occupation STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/dataset/movielens/users'; Creating Hive table
  • 107. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 7. Querying Data from Hive Table
  • 108. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 8. Loading Data to test_tbl Table $ hive hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; Creating Hive table hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data.csv Copying file: file:/tmp/test_tbl_data.csv Loading data to table default.test_tbl OK Time taken: 0.241 seconds hive (default)> Loading data to Hive table
  • 109. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 9. Reviewing Hive Table Content from HDFS Command and WebUI [hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl Found 1 items -rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08 /user/hive/warehouse/test_tbl/test_tbl_data.csv [hdadmin@localhost hdadmin]$ [hdadmin@localhost hdadmin]$ hadoop fs -cat /user/hive/warehouse/test_tbl/test_tbl_data.csv 1,USA 62,Indonesia 63,Philippines 65,Singapore 66,Thailand [hdadmin@localhost hdadmin]$
  • 110. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Loading Data to Hive Table $ hive hive (default)> hive> CREATE TABLE products ( prod_name STRING, description STRING, category STRING, qty_on_hand INT, prod_num STRING, packaged_with ARRAY<STRING> ) row format delimited fields terminated by ',' collection items terminated by ':' stored as textfile; Creating Hive table
  • 111. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding Pig
  • 112. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction A high-level platform for creating MapReduce programs Using Hadoop Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
  • 113. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig Components ● Two Compnents ● Language (Pig Latin) ● Compiler ● Two Execution Environments ● Local pig -x local ● Distributed pig -x mapreduce Hive.apache.org
  • 114. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running Pig ● Script pig myscript ● Command line (Grunt) pig ● Embedded Writing a java program Hive.apache.org
  • 115. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig Latin Hive.apache.org
  • 116. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig Execution Stages Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
  • 117. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Why Pig? ● Makes writing Hadoop jobs easier ● 5% of the code, 5% of the time ● You don't need to be a programmer to write Pig scripts ● Provide major functionality required for DatawareHouse and Analytics ● Load, Filter, Join, Group By, Order, Transform ● User can write custom UDFs (User Defined Function) Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
  • 118. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig v.s. Hive Hive.apache.org
  • 119. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running a Pig script
  • 120. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Pig # wget http://archive.apache.org/dist/hadoop/pig/stable/ pig-0.7.0.tar.gz # tar -xvzf pig-0.7.0.tar.gz # sudo mv pig-0.7.0 /usr/local/ # rm pig-0.7.0.tar.gz Install Pig binary file
  • 121. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Pig Edit $HOME ./bashrc # sudo vi $HOME/.bashrc
  • 122. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Pig Command Line
  • 123. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop countryFilter.pig A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C; #Preparing Data ubuntu@ip-172-31-12-11:~$ wget https://www.dropbox.com/s/pp168a6oiwqkxyu/ hdi-data.csv #Edit Your Script ubuntu@ip-172-31-12-11:~$ vi countryFilter.pig Writing a Pig Script
  • 124. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop ubuntu@ip-172-31-12-11:~$ pig -x local grunt > run countryFilter.pig Running a Pig Script
  • 125. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture: Understanding Sqoop
  • 126. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities: • Imports individual tables or entire databases to files in HDFS • Generates Java classes to allow you to interact with your imported data • Provides the ability to import from SQL databases straight into your Hive data warehouse See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
  • 127. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Architecture Overview Hive.apache.org
  • 128. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Loading Data from DBMS to Hadoop HDFS
  • 129. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Sqoop Hands-On Labs 1. Loading Data into MySQL DB 2. Installing Sqoop 3. Configuring Sqoop 4. Installing DB driver for Sqoop 5. Importing data from MySQL to Hive Table 6. Reviewing data from Hive Table 7. Reviewing HDFS Database Table files
  • 130. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1. MySQL RDS Server on AWS A RDS Server is running on AWS with the following configuration > database: imc_db > username: admin > password: imcinstitute >addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com [This address may change]
  • 131. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1. country_tbl data Testing data query from MySQL DB Table name > country_tbl
  • 132. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 2. Installing Sqoop # wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop- 1.0.0.tar.gz # tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz # sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/ # rm sqoop-1.4.5.bin__hadoop-1.0.0
  • 133. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Sqoop Edit $HOME ./bashrc # sudo vi $HOME/.bashrc
  • 134. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 3. Configuring Sqoop ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop- 1.0.0/conf/ ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh
  • 135. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 4. Installing DB driver for Sqoop ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop- 1.0.0/lib/ ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$ wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$ exit
  • 136. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 5. Importing data from MySQL to Hive Table [hdadmin@localhost ~]$sqoop import --connect jdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west- 2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl --hive-import --hive-table country -m 1 Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. Enter password: <enter here>
  • 137. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 6. Reviewing data from Hive Table
  • 138. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 7. Reviewing HDFS Database Table files Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse
  • 139. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 7. Reviewing HDFS Database Table files
  • 140. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding HBase
  • 141. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction An open source, non-relational, distributed database HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (, providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
  • 142. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Features ● Hadoop database modelled after Google's Bigtab;e ● Column oriented data store, known as Hadoop Database ● Support random realtime CRUD operations (unlike HDFS) ● No SQL Database ● Opensource, written in Java ● Run on a cluster of commodity hardware Hive.apache.org
  • 143. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop When to use Hbase? ● When you need high volume data to be stored ● Un-structured data ● Sparse data ● Column-oriented data ● Versioned data (same data template, captured at various time, time-elapse data) ● When you need high scalability Hive.apache.org
  • 144. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Which one to use? ● HDFS ● Only append dataset (no random write) ● Read the whole dataset (no random read) ● HBase ● Need random write and/or read ● Has thousands of operation per second on TB+ of data ● RDBMS ● Data fits on one big node ● Need full transaction support ● Need real-time query capabilities Hive.apache.org
  • 145. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  • 146. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  • 147. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Components Hive.apache.org ● Region ● Row of table are stores ● Region Server ● Hosts the tables ● Master ● Coordinating the Region Servers ● ZooKeeper ● HDFS ● API ● The Java Client API
  • 148. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Architecture Hive.apache.org
  • 149. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Shell Commands Hive.apache.org
  • 150. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running HBase
  • 151. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing HBase # wget http://apache.cs.utah.edu/hbase/hbase-1.0.0/hbase-1.0.0-bin.tar.gz # tar -xvzf hbase-1.0.0-bin.tar.gz # sudo mv hbase-1.0.0 /usr/local/ # rm hbase-1.0.0-bin.tar.gz
  • 152. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing HBase Edit $HOME ./bashrc # sudo vi $HOME/.bashrc
  • 153. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting HBase shell ubuntu@ip-172-31-12-11:~$ start-hbase.sh starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin- master-localhost.localdomain.out ubuntu@ip-172-31-12-11:~$$ jps 3064 TaskTracker 2836 SecondaryNameNode 2588 NameNode 3513 Jps 3327 HMaster 2938 JobTracker 2707 DataNode ubuntu@ip-172-31-12-11:~$ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013 hbase(main):001:0>
  • 154. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Create a table and insert data in HBase hbase(main):009:0> create 'test', 'cf' 0 row(s) in 1.0830 seconds hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1' 0 row(s) in 0.0750 seconds hbase(main):011:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1375363287644, value=val1 1 row(s) in 0.0640 seconds hbase(main):002:0> get 'test', 'row1' COLUMN CELL cf:a timestamp=1375363287644, value=val1 1 row(s) in 0.0370 seconds
  • 155. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Recommendation to Further Study
  • 156. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Thank you www.imcinstitute.com www.facebook.com/imcinstitute