SlideShare ist ein Scribd-Unternehmen logo
1 von 117
Downloaden Sie, um offline zu lesen
Big Data on Public Cloud 
Using Cloudera on GoGrid 
& Amazon EMR 
Hands On Workshop 
October 2014 
Dr.Thanachart Numnonda 
IMC Institute 
thanachart@imcinstitute.com 
Modifiy from Original Version by Danairat T. 
Certified Java Programmer, TOGAF – Silver 
danairat@gmail.com 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 1 danairat@gmail.com
Hands-On: Running Hadoop 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running this lab using Cloudera Live 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Cloudera Live on GoGrid 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Alternatively Using Cloudera VM 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting Hue on Cloudera 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Viewing HDFS 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Importing/Exporting 
Data to HDFS 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Importing Data to Hadoop 
Download War and Peace Full Text 
www.gutenberg.org/ebooks/2600 
$hadoop fs -mkdir input 
$hadoop fs -mkdir output 
$hadoop fs -copyFromLocal Downloads/pg2600.txt input 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Review file in Hadoop HDFS 
List HDFS File 
Read HDFS File 
[hdadmin@localhost bin]$ hadoop fs -cat input/pg2600.txt 
Retrieve HDFS File to Local File System 
[hdadmin@localhost bin]$ hadoop fs -copyToLocal input/pg2600.txt tmp/file.txt 
Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Review file in Hadoop HDFS using 
File Browse 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Review file in Hadoop HDFS using Hue 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hadoop Port Numbers 
Daemon Default 
Port 
Configuration Parameter in 
conf/*-site.xml 
HDFS Namenode 50070 dfs.http.address 
Datanodes 50075 dfs.datanode.http.address 
Secondarynamenode 50090 dfs.secondary.http.address 
MR JobTracker 50030 mapred.job.tracker.http.addre 
ss 
Tasktrackers 50060 mapred.task.tracker.http.addr 
ess 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Removing data from HDFS using 
Shell Command 
hdadmin@localhost detach]$ hadoop fs -rm input/pg2600.txt 
Deleted hdfs://localhost:54310/input/pg2600.txt 
hdadmin@localhost detach]$ 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Client 
Map Reduce 
Name Node Job Tracker 
Data Node 
Task Tracker 
Data Node 
Task Tracker 
Data Node 
Task Tracker 
Lecture: Understanding Map Reduce 
Processing 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
High Level Architecture of MapReduce 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Before MapReduce… 
● Large scale data processing was difficult! 
– Managing hundreds or thousands of processors 
– Managing parallelization and distribution 
– I/O Scheduling 
– Status and monitoring 
– Fault/crash tolerance 
● MapReduce provides all of these, easily! 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 18 danairat@gmail.com
MapReduce Overview 
● What is it? 
– Programming model used by Google 
– A combination of the Map and Reduce models with an 
associated implementation 
– Used for processing and generating large data sets 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 19 danairat@gmail.com
MapReduce Overview 
● How does it solve our previously mentioned problems? 
– MapReduce is highly scalable and can be used across many 
computers. 
– Many small machines can be used to process jobs that 
normally could not be processed by a large machine. 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 20 danairat@gmail.com
MapReduce Framework 
Source: www.bigdatauniversity.com 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
How does the MapReduce work? 
Partitioning 
Sorting 
RecordWriter 
Output in a list of (Key, List of Values) 
in the intermediate file 
InputSplit 
RecordReader 
Output in a list of (Key, Value) 
in the intermediate file 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
How does the MapReduce work? 
Partitioning 
Sorting 
RecordWriter 
Output in a list of (Key, List of Values) 
in the intermediate file 
InputSplit 
RecordReader 
Output in a list of (Key, Value) 
in the intermediate file 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Map Abstraction 
● Inputs a key/value pair 
– Key is a reference to the input value 
– Value is the data set on which to operate 
● Evaluation 
– Function defined by user 
– Applies to every value in value input 
● Might need to parse input 
● Produces a new list of key/value pairs 
– Can be different type from input pair 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 24 danairat@gmail.com
Reduce Abstraction 
● Starts with intermediate Key / Value pairs 
● Ends with finalized Key / Value pairs 
● Starting pairs are sorted by key 
● Iterator supplies the values for a given key to the 
Reduce function. 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 25 danairat@gmail.com
Reduce Abstraction 
● Typically a function that: 
– Starts with a large number of key/value pairs 
● One key/value for each word in all files being greped 
(including multiple entries for the same word) 
– Ends with very few key/value pairs 
● One key/value for each unique word across all the files with 
the number of instances summed into this entry 
● Broken up so a given worker works with input of the 
same key. 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 26 danairat@gmail.com
How Map and Reduce Work Together 
● Map returns information 
● Reduces accepts information 
● Reduce applies a user defined function to reduce the 
amount of data 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 27 danairat@gmail.com
Other Applications 
● Yahoo! 
– Webmap application uses Hadoop to create a database of 
information on all known webpages 
● Facebook 
– Hive data center uses Hadoop to provide business statistics to 
application developers and advertisers 
● Rackspace 
– Analyzes sever log files and usage data using Hadoop 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 28 danairat@gmail.com
MapReduce Framework 
map: (K1, V1) -> list(K2, V2)) 
reduce: (K2, list(V2)) -> list(K3, V3) 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Writing you own Map 
Reduce Program 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Wordcount (HelloWord in Hadoop) 
1. package org.myorg; 
2. 
3. import java.io.IOException; 
4. import java.util.*; 
5. 
6. import org.apache.hadoop.fs.Path; 
7. import org.apache.hadoop.conf.*; 
8. import org.apache.hadoop.io.*; 
9. import org.apache.hadoop.mapred.*; 
10. import org.apache.hadoop.util.*; 
11. 
12. public class WordCount { 
13. 
14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, 
IntWritable> { 
15. private final static IntWritable one = new IntWritable(1); 
16. private Text word = new Text(); 
17. 
18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter 
reporter) throws IOException { 
19. String line = value.toString(); 
20. StringTokenizer tokenizer = new StringTokenizer(line); 
21. while (tokenizer.hasMoreTokens()) { 
22. word.set(tokenizer.nextToken()); 
23. output.collect(word, one); 
24. } 
25. } 
26. } 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Wordcount (HelloWord in Hadoop) 
27. 
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, 
IntWritable> { 
29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> 
output, Reporter reporter) throws IOException { 
30. int sum = 0; 
31. while (values.hasNext()) { 
32. sum += values.next().get(); 
33. } 
34. output.collect(key, new IntWritable(sum)); 
35. } 
36. } 
37. 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Wordcount (HelloWord in Hadoop) 
38. public static void main(String[] args) throws Exception { 
39. JobConf conf = new JobConf(WordCount.class); 
40. conf.setJobName("wordcount"); 
41. 
42. conf.setOutputKeyClass(Text.class); 
43. conf.setOutputValueClass(IntWritable.class); 
44. 
45. conf.setMapperClass(Map.class); 
46. 
47. conf.setReducerClass(Reduce.class); 
48. 
49. conf.setInputFormat(TextInputFormat.class); 
50. conf.setOutputFormat(TextOutputFormat.class); 
51. 
52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 
53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
54. 
55. JobClient.runJob(conf); 
57. } 
58. } 
59. 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Packaging Map Reduce 
and Deploying to Hadoop Runtime 
Environment 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Packaging Map Reduce Program 
Usage 
Assuming we have already downloaded wordcount.jar into the folder /Downloads: 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Job in Hue 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Job in Hue 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Output Result 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Output Result 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Output Result 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running MapReduce Job Using Oozie 
: Select Java 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Running WordCount.jar 
on Amazon EMR 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Architecture Overview of Amazon EMR 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Creating an AWS account 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Signing up for the necessary services 
● Simple Storage Service (S3) 
● Elastic Compute Cloud (EC2) 
● Elastic MapReduce (EMR) 
Caution! This costs real money! 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Creating Amazon S3 bucket 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Upload .jar file and input file to 
Amazon S3 
1. Select <yourbucket> in Amazon S3 service 
2. Create folder : applications 
3. Upload wordcount.jar to the applications folder 
4. Create another folder: input 
5. Upload input_test.txt to the input folder 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Create a new cluster in EMR 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Assign a cluster name and log location 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Assign hardware configuration 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Select Custom JAR 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Input JAR Location and Arguments 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Select Create Cluster 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Monitoring the cluster 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
View Result from the S3 bucket 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Lecture 
Understanding Hive 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Introduction 
A Petabyte Scale Data Warehouse Using Hadoop 
Hive is developed by Facebook, designed to enable easy data 
summarization, ad-hoc querying and analysis of large 
volumes of data. It provides a simple query language called 
Hive QL, which is based on SQL 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
What Hive is NOT 
Hive is not designed for online transaction processing and 
does not offer real-time queries and row level updates. It is 
best used for batch jobs over large sets of immutable data 
(like web logs, etc.). 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hive Metastore 
● Store Hive metadata 
● Configurations 
– Embedded: in-process metastore, in-process database 
– Local: in-process metastore, out-of-process database 
– Remote: out-of-process metastore,out-of-process database 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 59 danairat@gmail.com
Hive Schema-On-Read 
● Faster loads into the database (simply copy or move) 
● Slower queries 
● Flexibility – multiple schemas for the same data 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 60 danairat@gmail.com
HiveQL 
● Hive Query Language 
● SQL dialect 
● No support for: 
– UPDATE, DELETE 
– Transactions 
– Indexes 
– HAVING clause in SELECT 
– Updateable or materialized views 
– Srored procedure 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 61 danairat@gmail.com
Hive Tables 
● Managed- CREATE TABLE 
– LOAD- File moved into Hive's data warehouse directory 
– DROP- Both data and metadata are deleted. 
● External- CREATE EXTERNAL TABLE 
– LOAD- No file moved 
– DROP- Only metadata deleted 
– Use when sharing data between Hive and Hadoop applications 
or you want to use multiple schema on the same data 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 62 danairat@gmail.com
Running Hive 
Hive Shell 
● Interactive 
hive 
● Script 
hive -f myscript 
● Inline 
hive -e 'SELECT * FROM mytable' 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
System Architecture and Components 
• Metastore: To store the meta data. 
• Query compiler and execution engine: To convert SQL queries to a 
sequence of map/reduce jobs that are then executed on Hadoop. 
• SerDe and ObjectInspectors: Programmable interfaces and 
implementations of common data formats and types. 
A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary 
representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java 
object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. 
• UDF and UDAF: Programmable interfaces and implementations for 
user defined functions (scalar and aggregate functions). 
• Clients: Command line client similar to Mysql command line. 
hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Architecture Overview 
HDFS 
Hive 
Hive CLI 
Browsing Queries 
Map Reduce 
Thrift API 
MetaStore 
Execution 
SerDe 
Hive QL 
Thrift Jute JSON.. 
DDL 
Parser 
Planner 
Mgmt. 
Web UI 
HDFS 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Sample HiveQL 
The Query compiler uses the information stored in the metastore to 
convert SQL queries into a sequence of map/reduce jobs, e.g. the 
following query 
SELECT * FROM t where t.c = 'xyz' 
SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1) 
SELECT t1.c1, count(1) from t1 group by t1.c1 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Creating Table and 
Retrieving Data using Hive 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running Hive from terminal 
Starting Hive 
Quit from Hive 
hive> quit; 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting Hive Editor from Hue 
Scroll Down 
the web page 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Creating Hive Table 
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; 
OK 
Time taken: 4.069 seconds 
hive (default)> show tables; 
OK 
test_tbl 
Time taken: 0.138 seconds 
hive (default)> describe test_tbl; 
OK 
id int 
country string 
Time taken: 0.147 seconds 
hive (default)> 
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Hue Query Editor 
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Hue Query Editor 
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Hue Query Editor 
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing Hive Table in HDFS 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Alter and Drop Hive Table 
hive (default)> alter table test_tbl add columns (remarks STRING); 
hive (default)> describe test_tbl; 
OK 
id int 
country string 
remarks string 
Time taken: 0.077 seconds 
hive (default)> drop table test_tbl; 
OK 
Time taken: 0.9 seconds 
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Loading Data to Hive Table 
Creating Hive table 
$ hive 
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; 
Loading data to Hive table 
hive (default)> LOAD DATA LOCAL INPATH '/tmp/country.csv' INTO TABLE test_tbl; 
Copying data from file:/tmp/test_tbl_data.csv 
Copying file: file:/tmp/test_tbl_data.csv 
Loading data to table default.test_tbl 
OK 
Time taken: 0.241 seconds 
hive (default)> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Querying Data from Hive Table 
hive (default)> select * from test_tbl; 
OK 
1 USA 
62 Indonesia 
63 Philippines 
65 Singapore 
66 Thailand 
Time taken: 0.287 seconds 
hive (default)> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Insert Overwriting the Hive Table 
hive (default)> LOAD DATA LOCAL INPATH 
'/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl; 
Copying data from file:/tmp/test_tbl_data_updated.csv 
Copying file: file:/tmp/test_tbl_data_updated.csv 
Loading data to table default.test_tbl 
Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl 
OK 
Time taken: 0.204 seconds 
hive (default)> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Lecture 
Understanding Pig 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Introduction 
A high-level platform for creating MapReduce programs Using Hadoop 
Pig is a platform for analyzing large data sets that consists of 
a high-level language for expressing data analysis programs, 
coupled with infrastructure for evaluating these programs. 
The salient property of Pig programs is that their structure is 
amenable to substantial parallelization, which in turns enables 
them to handle very large data sets. 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Pig Components 
● Two Compnents 
● Language (Pig Latin) 
● Compiler 
● Two Execution Environments 
● Local 
pig -x local 
● Distributed 
pig -x mapreduce 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running Pig 
● Script 
pig myscript 
● Command line (Grunt) 
pig 
● Embedded 
Writing a java program 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Pig Latin 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Pig Execution Stages 
Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Why Pig? 
● Makes writing Hadoop jobs easier 
● 5% of the code, 5% of the time 
● You don't need to be a programmer to write Pig scripts 
● Provide major functionality required for 
DatawareHouse and Analytics 
● Load, Filter, Join, Group By, Order, Transform 
● User can write custom UDFs (User Defined Function) 
Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Running a Pig script 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting Pig Command Line 
[hdadmin@localhost ~]$ pig -x local 
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig 
version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53 
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error 
messages to: /home/hdadmin/pig_1375327740024.log 
2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - 
Default bootup file /home/hdadmin/.pigbootup not found 
2013-08-01 10:29:00,212 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
to hadoop file system at: file:/// 
grunt> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting Pig from Hue 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Writing a Pig Script 
#Preparing Data 
Download hdi-data.csv 
#Edit Your Script 
[hdadmin@localhost ~]$ cd Downloads/ 
[hdadmin@localhost ~]$ vi countryFilter.pig 
countryFilter.pig 
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, 
lifeex:int, mysch:i 
nt, eysch:int, gni:int); 
B = FILTER A BY gni > 2000; 
C = ORDER B BY gni; 
dump C; 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running a Pig Script 
[hdadmin@localhost ~]$ cd Downloads 
[hdadmin@localhost ~]$ pig -x local 
grunt > run countryFilter.pig 
.... 
(150,Cameroon,0.482,51,5,10,2031) 
(126,Kyrgyzstan,0.615,67,9,12,2036) 
(156,Nigeria,0.459,51,5,8,2069) 
(154,Yemen,0.462,65,2,8,2213) 
(138,Lao People's Democratic Republic,0.524,67,4,9,2242) 
(153,Papua New Guinea,0.466,62,4,5,2271) 
(165,Djibouti,0.43,57,3,5,2335) 
(129,Nicaragua,0.589,74,5,10,2430) 
(145,Pakistan,0.504,65,4,6,2550) 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Lecture: Understanding Sqoop 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Introduction 
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line 
tool with the following capabilities: 
• Imports individual tables or entire databases to files in 
HDFS 
• Generates Java classes to allow you to interact with your 
imported data 
• Provides the ability to import from SQL databases straight 
into your Hive data warehouse 
See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Architecture Overview 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Loading Data from DBMS 
to Hadoop HDFS 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Loading Data into MySQL DB 
[root@localhost ~]# service mysqld start 
Starting mysqld: [ OK ] 
[root@localhost ~]# mysql 
mysql> create database countrydb; 
Query OK, 1 row affected (0.00 sec) 
mysql> use countrydb; 
Database changed 
mysql> create table country_tbl(id INT, country VARCHAR(100)); 
Query OK, 0 rows affected (0.02 sec) 
mysql> LOAD DATA LOCAL INFILE '/tmp/country_data.csv' INTO TABLE 
country_tbl FIELDS terminated by ',' LINES TERMINATED BY 'rn'; 
Query OK, 237 rows affected, 4 warnings (0.00 sec) 
Records: 237 Deleted: 0 Skipped: 0 Warnings: 0 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Loading Data into MySQL DB 
Testing data query from MySQL DB 
mysql> select * from country_tbl; 
+------+------------------------------+ 
| id | country | 
+------+------------------------------+ 
| 93 | Afghanistan | 
| 355 | Albania | 
| 213 | Algeria | 
| 1684 | AmericanSamoa | 
| 376 | Andorra | 
... 
+------+------------------------------+ 
237 rows in set (0.00 sec) 
mysql> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Installing DB driver for Sqoop 
[hdadmin@localhost conf]$ su - 
Password: 
[root@localhost ~]# cp /01_hadoop_workshop/03_sqoop1.4.2/mysql-connector- 
java-5.0.8-bin.jar /usr/local/sqoop-1.4.2.bin__hadoop- 
0.20/lib/ 
[root@localhost ~]# chown hdadmin:hadoop /usr/local/sqoop- 
1.4.2.bin__hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar 
[root@localhost ~]# exit 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Importing data from MySQL to Hive Table 
[hdadmin@localhost ~]$ sqoop import --connect 
jdbc:mysql://localhost/countrydb --username root -P --table 
country_tbl --hive-import --hive-table country_tbl -m 1 
Warning: /usr/lib/hbase does not exist! HBase imports will fail. 
Please set $HBASE_HOME to the root of your HBase installation. 
Warning: $HADOOP_HOME is deprecated. 
Enter password: <enter here> 
13/03/21 18:07:43 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override 
13/03/21 18:07:43 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc. 
13/03/21 18:07:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 
13/03/21 18:07:43 INFO tool.CodeGenTool: Beginning code generation 
13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1 
13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1 
13/03/21 18:07:44 INFO orm.CompilationManager: HADOOP_HOME is /usr/local/hadoop/libexec/.. 
Note: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.java uses or overrides a deprecated API. 
Note: Recompile with -Xlint:deprecation for details. 
13/03/21 18:07:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.jar 
13/03/21 18:07:44 WARN manager.MySQLManager: It looks like you are importing from mysql. 
13/03/21 18:07:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 
13/03/21 18:07:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 
13/03/21 18:07:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 
13/03/21 18:07:44 INFO mapreduce.ImportJobBase: Beginning import of country_tbl 
13/03/21 18:07:45 INFO mapred.JobClient: Running job: job_201303211744_0001 
13/03/21 18:07:46 INFO mapred.JobClient: map 0% reduce 0% 
13/03/21 18:08:02 INFO mapred.JobClient: map 100% reduce 0% 
13/03/21 18:08:07 INFO mapred.JobClient: Job complete: job_201303211744_0001 
13/03/21 18:08:07 INFO mapred.JobClient: Counters: 18 
13/03/21 18:08:07 INFO mapred.JobClient: Job Counters 
13/03/21 18:08:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12154 
13/03/21 18:08:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 
Big Data Hadoop 13/03/21 18:on 08:07 INFO Amazon mapred.JobClient: Total EMR time spent by – all Hands maps waiting after reserving On slots Workshop (ms)=0 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013 
13/03/21 18:08:07 INFO mapred.JobClient: Launched map tasks=1
Reviewing data from Hive Table 
[hdadmin@localhost ~]$ hive 
Logging initialized using configuration in file:/usr/local/hive-0.9.0- 
bin/conf/hive-log4j.properties 
Hive history 
file=/tmp/hdadmin/hive_job_log_hdadmin_201303211810_964909984.txt 
hive (default)> show tables; 
OK 
country_tbl 
test_tbl 
Time taken: 2.566 seconds 
hive (default)> select * from country_tbl; 
OK 
93 Afghanistan 
355 Albania 
..... 
Time taken: 0.587 seconds 
hive (default)> quit; 
[hdadmin@localhost ~]$ 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing HDFS Database Table files 
Start Web Browser to http://localhost:50070/ then navigate to /user/hive/warehouse 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Lecture 
Understanding HBase 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Introduction 
An open source, non-relational, distributed database 
HBase is an open source, non-relational, distributed database 
modeled after Google's BigTable and is written in Java. It is 
developed as part of Apache Software Foundation's Apache 
Hadoop project and runs on top of HDFS (, providing 
BigTable-like capabilities for Hadoop. That is, it provides a 
fault-tolerant way of storing large quantities of sparse data. 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
HBase Features 
● Hadoop database modelled after Google's Bigtab;e 
● Column oriented data store, known as Hadoop Database 
● Support random realtime CRUD operations (unlike 
HDFS) 
● No SQL Database 
● Opensource, written in Java 
● Run on a cluster of commodity hardware 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
When to use Hbase? 
● When you need high volume data to be stored 
● Un-structured data 
● Sparse data 
● Column-oriented data 
● Versioned data (same data template, captured at various 
time, time-elapse data) 
● When you need high scalability 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Which one to use? 
● HDFS 
● Only append dataset (no random write) 
● Read the whole dataset (no random read) 
● HBase 
● Need random write and/or read 
● Has thousands of operation per second on TB+ of data 
● RDBMS 
● Data fits on one big node 
● Need full transaction support 
● Need real-time query capabilities 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
HBase Components 
Hive.apache.org 
● Region 
● Row of table are stores 
● Region Server 
● Hosts the tables 
● Master 
● Coordinating the Region 
Servers 
● ZooKeeper 
● HDFS 
● API 
● The Java Client API 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
HBase Shell Commands 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Running HBase 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting HBase shell 
[hdadmin@localhost ~]$ start-hbase.sh 
starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master- 
localhost.localdomain.out 
[hdadmin@localhost ~]$ jps 
3064 TaskTracker 
2836 SecondaryNameNode 
2588 NameNode 
3513 Jps 
3327 HMaster 
2938 JobTracker 
2707 DataNode 
[hdadmin@localhost ~]$ hbase shell 
HBase Shell; enter 'help<RETURN>' for list of supported commands. 
Type "exit<RETURN>" to leave the HBase Shell 
Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013 
hbase(main):001:0> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Create a table and insert data in HBase 
hbase(main):009:0> create 'test', 'cf' 
0 row(s) in 1.0830 seconds 
hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1' 
0 row(s) in 0.0750 seconds 
hbase(main):011:0> scan 'test' 
ROW COLUMN+CELL 
row1 column=cf:a, timestamp=1375363287644, 
value=val1 
1 row(s) in 0.0640 seconds 
hbase(main):002:0> get 'test', 'row1' 
COLUMN CELL 
cf:a timestamp=1375363287644, value=val1 
1 row(s) in 0.0370 seconds 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Data Browsers in Hue for HBase 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Data Browsers in Hue for HBase 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Recommendation to Further Study 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Thank you 
www.imcinstitute.com 
www.facebook.com/imcinstitute 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in ActionIMC Institute
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/ProductionIMC Institute
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartIMC Institute
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)IMC Institute
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveIMC Institute
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2IMC Institute
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerIMC Institute
 
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02matrixvn
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Building Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFuBuilding Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFuMatthew Hayes
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
New developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeNew developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeXiao Li
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsAvkash Chauhan
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]David Przybilla
 
Hadoop Career Path and Interview Preparation
Hadoop Career Path and Interview PreparationHadoop Career Path and Interview Preparation
Hadoop Career Path and Interview PreparationEdureka!
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 

Was ist angesagt? (20)

Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in Action
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/Production
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
Setting up Hadoop YARN Clustering
Setting up Hadoop YARN ClusteringSetting up Hadoop YARN Clustering
Setting up Hadoop YARN Clustering
 
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Building Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFuBuilding Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFu
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
New developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeNew developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lake
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]
 
Hadoop Career Path and Interview Preparation
Hadoop Career Path and Interview PreparationHadoop Career Path and Interview Preparation
Hadoop Career Path and Interview Preparation
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 

Andere mochten auch

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลCloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลIMC Institute
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public CloudIMC Institute
 
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SMEการบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SMEIMC Institute
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformIMC Institute
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIMC Institute
 
Thailand ICT Review 2014
Thailand ICT Review 2014Thailand ICT Review 2014
Thailand ICT Review 2014IMC Institute
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a ServiceIMC Institute
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in ChinaIMC Institute
 
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016IMC Institute
 
Big data project management
Big data project managementBig data project management
Big data project managementIMC Institute
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015IMC Institute
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
เทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษาเทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษา
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษาIMC Institute
 
บทความ Big Data School ใน IMC e-Magazine
บทความ Big Data School ใน IMC e-Magazineบทความ Big Data School ใน IMC e-Magazine
บทความ Big Data School ใน IMC e-MagazineIMC Institute
 

Andere mochten auch (18)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลCloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public Cloud
 
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SMEการบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud Platform
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
 
Thailand ICT Review 2014
Thailand ICT Review 2014Thailand ICT Review 2014
Thailand ICT Review 2014
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a Service
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in China
 
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
 
Big data project management
Big data project managementBig data project management
Big data project management
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
เทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษาเทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษา
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
 
ITSS Overview
ITSS OverviewITSS Overview
ITSS Overview
 
บทความ Big Data School ใน IMC e-Magazine
บทความ Big Data School ใน IMC e-Magazineบทความ Big Data School ใน IMC e-Magazine
บทความ Big Data School ใน IMC e-Magazine
 

Ähnlich wie Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR

Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionAdnan Masood
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsAmazon Web Services
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Where to Deploy Hadoop: Bare-metal or Cloud?
Where to Deploy Hadoop:  Bare-metal or Cloud?Where to Deploy Hadoop:  Bare-metal or Cloud?
Where to Deploy Hadoop: Bare-metal or Cloud?Mike Wendt
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? DataWorks Summit
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)DataWorks Summit
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by KeylabsSiva Sankar
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists CCG
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAbhijit Sharma
 

Ähnlich wie Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR (20)

Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Where to Deploy Hadoop: Bare-metal or Cloud?
Where to Deploy Hadoop:  Bare-metal or Cloud?Where to Deploy Hadoop:  Bare-metal or Cloud?
Where to Deploy Hadoop: Bare-metal or Cloud?
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
 
Big data
Big dataBig data
Big data
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
 

Mehr von IMC Institute

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14IMC Institute
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019IMC Institute
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AIIMC Institute
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12IMC Institute
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital TransformationIMC Institute
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIMC Institute
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมIMC Institute
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11IMC Institute
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationIMC Institute
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon ValleyIMC Institute
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10IMC Institute
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationIMC Institute
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)IMC Institute
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง IMC Institute
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9 IMC Institute
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016IMC Institute
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger IMC Institute
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.orgIMC Institute
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgIMC Institute
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital TransformationIMC Institute
 

Mehr von IMC Institute (20)

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AI
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to Work
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon Valley
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.org
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.org
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
 

Kürzlich hochgeladen

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Kürzlich hochgeladen (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR

  • 1. Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR Hands On Workshop October 2014 Dr.Thanachart Numnonda IMC Institute thanachart@imcinstitute.com Modifiy from Original Version by Danairat T. Certified Java Programmer, TOGAF – Silver danairat@gmail.com Danairat T., 2013, Big Data Hadoop – Hands On Workshop 1 danairat@gmail.com
  • 2. Hands-On: Running Hadoop Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 3. Running this lab using Cloudera Live Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 4. Using Cloudera Live on GoGrid Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 5. Alternatively Using Cloudera VM Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 6. Starting Hue on Cloudera Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 7. Viewing HDFS Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 8. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 9. Hands-On: Importing/Exporting Data to HDFS Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 10. Importing Data to Hadoop Download War and Peace Full Text www.gutenberg.org/ebooks/2600 $hadoop fs -mkdir input $hadoop fs -mkdir output $hadoop fs -copyFromLocal Downloads/pg2600.txt input Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 11. Review file in Hadoop HDFS List HDFS File Read HDFS File [hdadmin@localhost bin]$ hadoop fs -cat input/pg2600.txt Retrieve HDFS File to Local File System [hdadmin@localhost bin]$ hadoop fs -copyToLocal input/pg2600.txt tmp/file.txt Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 12. Review file in Hadoop HDFS using File Browse Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 13. Review file in Hadoop HDFS using Hue Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 14. Hadoop Port Numbers Daemon Default Port Configuration Parameter in conf/*-site.xml HDFS Namenode 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address MR JobTracker 50030 mapred.job.tracker.http.addre ss Tasktrackers 50060 mapred.task.tracker.http.addr ess Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 15. Removing data from HDFS using Shell Command hdadmin@localhost detach]$ hadoop fs -rm input/pg2600.txt Deleted hdfs://localhost:54310/input/pg2600.txt hdadmin@localhost detach]$ Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 16. Client Map Reduce Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Lecture: Understanding Map Reduce Processing Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 17. High Level Architecture of MapReduce Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 18. Before MapReduce… ● Large scale data processing was difficult! – Managing hundreds or thousands of processors – Managing parallelization and distribution – I/O Scheduling – Status and monitoring – Fault/crash tolerance ● MapReduce provides all of these, easily! Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 18 danairat@gmail.com
  • 19. MapReduce Overview ● What is it? – Programming model used by Google – A combination of the Map and Reduce models with an associated implementation – Used for processing and generating large data sets Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 19 danairat@gmail.com
  • 20. MapReduce Overview ● How does it solve our previously mentioned problems? – MapReduce is highly scalable and can be used across many computers. – Many small machines can be used to process jobs that normally could not be processed by a large machine. Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 20 danairat@gmail.com
  • 21. MapReduce Framework Source: www.bigdatauniversity.com Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 22. How does the MapReduce work? Partitioning Sorting RecordWriter Output in a list of (Key, List of Values) in the intermediate file InputSplit RecordReader Output in a list of (Key, Value) in the intermediate file Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 23. How does the MapReduce work? Partitioning Sorting RecordWriter Output in a list of (Key, List of Values) in the intermediate file InputSplit RecordReader Output in a list of (Key, Value) in the intermediate file Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 24. Map Abstraction ● Inputs a key/value pair – Key is a reference to the input value – Value is the data set on which to operate ● Evaluation – Function defined by user – Applies to every value in value input ● Might need to parse input ● Produces a new list of key/value pairs – Can be different type from input pair Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 24 danairat@gmail.com
  • 25. Reduce Abstraction ● Starts with intermediate Key / Value pairs ● Ends with finalized Key / Value pairs ● Starting pairs are sorted by key ● Iterator supplies the values for a given key to the Reduce function. Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 25 danairat@gmail.com
  • 26. Reduce Abstraction ● Typically a function that: – Starts with a large number of key/value pairs ● One key/value for each word in all files being greped (including multiple entries for the same word) – Ends with very few key/value pairs ● One key/value for each unique word across all the files with the number of instances summed into this entry ● Broken up so a given worker works with input of the same key. Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 26 danairat@gmail.com
  • 27. How Map and Reduce Work Together ● Map returns information ● Reduces accepts information ● Reduce applies a user defined function to reduce the amount of data Danairat T., 2013, Big Data Hadoop – Hands On Workshop 27 danairat@gmail.com
  • 28. Other Applications ● Yahoo! – Webmap application uses Hadoop to create a database of information on all known webpages ● Facebook – Hive data center uses Hadoop to provide business statistics to application developers and advertisers ● Rackspace – Analyzes sever log files and usage data using Hadoop Danairat T., 2013, Big Data Hadoop – Hands On Workshop 28 danairat@gmail.com
  • 29. MapReduce Framework map: (K1, V1) -> list(K2, V2)) reduce: (K2, list(V2)) -> list(K3, V3) Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 30. Hands-On: Writing you own Map Reduce Program Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 31. Wordcount (HelloWord in Hadoop) 1. package org.myorg; 2. 3. import java.io.IOException; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; 11. 12. public class WordCount { 13. 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. } Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 32. Wordcount (HelloWord in Hadoop) 27. 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 33. Wordcount (HelloWord in Hadoop) 38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount"); 41. 42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class); 44. 45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class); 48. 49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 34. Hands-On: Packaging Map Reduce and Deploying to Hadoop Runtime Environment Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 35. Packaging Map Reduce Program Usage Assuming we have already downloaded wordcount.jar into the folder /Downloads: Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 36. Reviewing MapReduce Job in Hue Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 37. Reviewing MapReduce Job in Hue Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 38. Reviewing MapReduce Output Result Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 39. Reviewing MapReduce Output Result Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 40. Reviewing MapReduce Output Result Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 41. Running MapReduce Job Using Oozie : Select Java Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 42. Hands-On: Running WordCount.jar on Amazon EMR Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 43. Architecture Overview of Amazon EMR Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 44. Creating an AWS account Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 45. Signing up for the necessary services ● Simple Storage Service (S3) ● Elastic Compute Cloud (EC2) ● Elastic MapReduce (EMR) Caution! This costs real money! Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 46. Creating Amazon S3 bucket Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 47. Upload .jar file and input file to Amazon S3 1. Select <yourbucket> in Amazon S3 service 2. Create folder : applications 3. Upload wordcount.jar to the applications folder 4. Create another folder: input 5. Upload input_test.txt to the input folder Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 48. Create a new cluster in EMR Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 49. Assign a cluster name and log location Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 50. Assign hardware configuration Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 51. Select Custom JAR Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 52. Input JAR Location and Arguments Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 53. Select Create Cluster Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 54. Monitoring the cluster Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 55. View Result from the S3 bucket Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 56. Lecture Understanding Hive Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 57. Introduction A Petabyte Scale Data Warehouse Using Hadoop Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 58. What Hive is NOT Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.). Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 59. Hive Metastore ● Store Hive metadata ● Configurations – Embedded: in-process metastore, in-process database – Local: in-process metastore, out-of-process database – Remote: out-of-process metastore,out-of-process database Danairat T., 2013, Big Data Hadoop – Hands On Workshop 59 danairat@gmail.com
  • 60. Hive Schema-On-Read ● Faster loads into the database (simply copy or move) ● Slower queries ● Flexibility – multiple schemas for the same data Danairat T., 2013, Big Data Hadoop – Hands On Workshop 60 danairat@gmail.com
  • 61. HiveQL ● Hive Query Language ● SQL dialect ● No support for: – UPDATE, DELETE – Transactions – Indexes – HAVING clause in SELECT – Updateable or materialized views – Srored procedure Danairat T., 2013, Big Data Hadoop – Hands On Workshop 61 danairat@gmail.com
  • 62. Hive Tables ● Managed- CREATE TABLE – LOAD- File moved into Hive's data warehouse directory – DROP- Both data and metadata are deleted. ● External- CREATE EXTERNAL TABLE – LOAD- No file moved – DROP- Only metadata deleted – Use when sharing data between Hive and Hadoop applications or you want to use multiple schema on the same data Danairat T., 2013, Big Data Hadoop – Hands On Workshop 62 danairat@gmail.com
  • 63. Running Hive Hive Shell ● Interactive hive ● Script hive -f myscript ● Inline hive -e 'SELECT * FROM mytable' Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 64. System Architecture and Components • Metastore: To store the meta data. • Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop. • SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. • UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions). • Clients: Command line client similar to Mysql command line. hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 65. Architecture Overview HDFS Hive Hive CLI Browsing Queries Map Reduce Thrift API MetaStore Execution SerDe Hive QL Thrift Jute JSON.. DDL Parser Planner Mgmt. Web UI HDFS Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 66. Sample HiveQL The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query SELECT * FROM t where t.c = 'xyz' SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1) SELECT t1.c1, count(1) from t1 group by t1.c1 Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 67. Hands-On: Creating Table and Retrieving Data using Hive Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 68. Running Hive from terminal Starting Hive Quit from Hive hive> quit; Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 69. Starting Hive Editor from Hue Scroll Down the web page Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 70. Creating Hive Table hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 4.069 seconds hive (default)> show tables; OK test_tbl Time taken: 0.138 seconds hive (default)> describe test_tbl; OK id int country string Time taken: 0.147 seconds hive (default)> See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 71. Using Hue Query Editor See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 72. Using Hue Query Editor See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 73. Using Hue Query Editor See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 74. Reviewing Hive Table in HDFS Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 75. Alter and Drop Hive Table hive (default)> alter table test_tbl add columns (remarks STRING); hive (default)> describe test_tbl; OK id int country string remarks string Time taken: 0.077 seconds hive (default)> drop table test_tbl; OK Time taken: 0.9 seconds See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 76. Loading Data to Hive Table Creating Hive table $ hive hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; Loading data to Hive table hive (default)> LOAD DATA LOCAL INPATH '/tmp/country.csv' INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data.csv Copying file: file:/tmp/test_tbl_data.csv Loading data to table default.test_tbl OK Time taken: 0.241 seconds hive (default)> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 77. Querying Data from Hive Table hive (default)> select * from test_tbl; OK 1 USA 62 Indonesia 63 Philippines 65 Singapore 66 Thailand Time taken: 0.287 seconds hive (default)> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 78. Insert Overwriting the Hive Table hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data_updated.csv Copying file: file:/tmp/test_tbl_data_updated.csv Loading data to table default.test_tbl Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl OK Time taken: 0.204 seconds hive (default)> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 79. Lecture Understanding Pig Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 80. Introduction A high-level platform for creating MapReduce programs Using Hadoop Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 81. Pig Components ● Two Compnents ● Language (Pig Latin) ● Compiler ● Two Execution Environments ● Local pig -x local ● Distributed pig -x mapreduce Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 82. Running Pig ● Script pig myscript ● Command line (Grunt) pig ● Embedded Writing a java program Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 83. Pig Latin Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 84. Pig Execution Stages Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 85. Why Pig? ● Makes writing Hadoop jobs easier ● 5% of the code, 5% of the time ● You don't need to be a programmer to write Pig scripts ● Provide major functionality required for DatawareHouse and Analytics ● Load, Filter, Join, Group By, Order, Transform ● User can write custom UDFs (User Defined Function) Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 86. Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 87. Hands-On: Running a Pig script Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 88. Starting Pig Command Line [hdadmin@localhost ~]$ pig -x local 2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53 2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hdadmin/pig_1375327740024.log 2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hdadmin/.pigbootup not found 2013-08-01 10:29:00,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 89. Starting Pig from Hue Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 90. Writing a Pig Script #Preparing Data Download hdi-data.csv #Edit Your Script [hdadmin@localhost ~]$ cd Downloads/ [hdadmin@localhost ~]$ vi countryFilter.pig countryFilter.pig A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:i nt, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C; Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 91. Running a Pig Script [hdadmin@localhost ~]$ cd Downloads [hdadmin@localhost ~]$ pig -x local grunt > run countryFilter.pig .... (150,Cameroon,0.482,51,5,10,2031) (126,Kyrgyzstan,0.615,67,9,12,2036) (156,Nigeria,0.459,51,5,8,2069) (154,Yemen,0.462,65,2,8,2213) (138,Lao People's Democratic Republic,0.524,67,4,9,2242) (153,Papua New Guinea,0.466,62,4,5,2271) (165,Djibouti,0.43,57,3,5,2335) (129,Nicaragua,0.589,74,5,10,2430) (145,Pakistan,0.504,65,4,6,2550) Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 92. Lecture: Understanding Sqoop Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 93. Introduction Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities: • Imports individual tables or entire databases to files in HDFS • Generates Java classes to allow you to interact with your imported data • Provides the ability to import from SQL databases straight into your Hive data warehouse See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 94. Architecture Overview Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 95. Hands-On: Loading Data from DBMS to Hadoop HDFS Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 96. Loading Data into MySQL DB [root@localhost ~]# service mysqld start Starting mysqld: [ OK ] [root@localhost ~]# mysql mysql> create database countrydb; Query OK, 1 row affected (0.00 sec) mysql> use countrydb; Database changed mysql> create table country_tbl(id INT, country VARCHAR(100)); Query OK, 0 rows affected (0.02 sec) mysql> LOAD DATA LOCAL INFILE '/tmp/country_data.csv' INTO TABLE country_tbl FIELDS terminated by ',' LINES TERMINATED BY 'rn'; Query OK, 237 rows affected, 4 warnings (0.00 sec) Records: 237 Deleted: 0 Skipped: 0 Warnings: 0 Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 97. Loading Data into MySQL DB Testing data query from MySQL DB mysql> select * from country_tbl; +------+------------------------------+ | id | country | +------+------------------------------+ | 93 | Afghanistan | | 355 | Albania | | 213 | Algeria | | 1684 | AmericanSamoa | | 376 | Andorra | ... +------+------------------------------+ 237 rows in set (0.00 sec) mysql> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 98. Installing DB driver for Sqoop [hdadmin@localhost conf]$ su - Password: [root@localhost ~]# cp /01_hadoop_workshop/03_sqoop1.4.2/mysql-connector- java-5.0.8-bin.jar /usr/local/sqoop-1.4.2.bin__hadoop- 0.20/lib/ [root@localhost ~]# chown hdadmin:hadoop /usr/local/sqoop- 1.4.2.bin__hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar [root@localhost ~]# exit Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 99. Importing data from MySQL to Hive Table [hdadmin@localhost ~]$ sqoop import --connect jdbc:mysql://localhost/countrydb --username root -P --table country_tbl --hive-import --hive-table country_tbl -m 1 Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. Enter password: <enter here> 13/03/21 18:07:43 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override 13/03/21 18:07:43 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc. 13/03/21 18:07:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 13/03/21 18:07:43 INFO tool.CodeGenTool: Beginning code generation 13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1 13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1 13/03/21 18:07:44 INFO orm.CompilationManager: HADOOP_HOME is /usr/local/hadoop/libexec/.. Note: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 13/03/21 18:07:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.jar 13/03/21 18:07:44 WARN manager.MySQLManager: It looks like you are importing from mysql. 13/03/21 18:07:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 13/03/21 18:07:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 13/03/21 18:07:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 13/03/21 18:07:44 INFO mapreduce.ImportJobBase: Beginning import of country_tbl 13/03/21 18:07:45 INFO mapred.JobClient: Running job: job_201303211744_0001 13/03/21 18:07:46 INFO mapred.JobClient: map 0% reduce 0% 13/03/21 18:08:02 INFO mapred.JobClient: map 100% reduce 0% 13/03/21 18:08:07 INFO mapred.JobClient: Job complete: job_201303211744_0001 13/03/21 18:08:07 INFO mapred.JobClient: Counters: 18 13/03/21 18:08:07 INFO mapred.JobClient: Job Counters 13/03/21 18:08:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12154 13/03/21 18:08:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 Big Data Hadoop 13/03/21 18:on 08:07 INFO Amazon mapred.JobClient: Total EMR time spent by – all Hands maps waiting after reserving On slots Workshop (ms)=0 Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013 13/03/21 18:08:07 INFO mapred.JobClient: Launched map tasks=1
  • 100. Reviewing data from Hive Table [hdadmin@localhost ~]$ hive Logging initialized using configuration in file:/usr/local/hive-0.9.0- bin/conf/hive-log4j.properties Hive history file=/tmp/hdadmin/hive_job_log_hdadmin_201303211810_964909984.txt hive (default)> show tables; OK country_tbl test_tbl Time taken: 2.566 seconds hive (default)> select * from country_tbl; OK 93 Afghanistan 355 Albania ..... Time taken: 0.587 seconds hive (default)> quit; [hdadmin@localhost ~]$ Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 101. Reviewing HDFS Database Table files Start Web Browser to http://localhost:50070/ then navigate to /user/hive/warehouse Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 102. Lecture Understanding HBase Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 103. Introduction An open source, non-relational, distributed database HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (, providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 104. HBase Features ● Hadoop database modelled after Google's Bigtab;e ● Column oriented data store, known as Hadoop Database ● Support random realtime CRUD operations (unlike HDFS) ● No SQL Database ● Opensource, written in Java ● Run on a cluster of commodity hardware Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 105. When to use Hbase? ● When you need high volume data to be stored ● Un-structured data ● Sparse data ● Column-oriented data ● Versioned data (same data template, captured at various time, time-elapse data) ● When you need high scalability Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 106. Which one to use? ● HDFS ● Only append dataset (no random write) ● Read the whole dataset (no random read) ● HBase ● Need random write and/or read ● Has thousands of operation per second on TB+ of data ● RDBMS ● Data fits on one big node ● Need full transaction support ● Need real-time query capabilities Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 107. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 108. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 109. HBase Components Hive.apache.org ● Region ● Row of table are stores ● Region Server ● Hosts the tables ● Master ● Coordinating the Region Servers ● ZooKeeper ● HDFS ● API ● The Java Client API Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 110. HBase Shell Commands Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 111. Hands-On: Running HBase Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 112. Starting HBase shell [hdadmin@localhost ~]$ start-hbase.sh starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master- localhost.localdomain.out [hdadmin@localhost ~]$ jps 3064 TaskTracker 2836 SecondaryNameNode 2588 NameNode 3513 Jps 3327 HMaster 2938 JobTracker 2707 DataNode [hdadmin@localhost ~]$ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013 hbase(main):001:0> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 113. Create a table and insert data in HBase hbase(main):009:0> create 'test', 'cf' 0 row(s) in 1.0830 seconds hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1' 0 row(s) in 0.0750 seconds hbase(main):011:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1375363287644, value=val1 1 row(s) in 0.0640 seconds hbase(main):002:0> get 'test', 'row1' COLUMN CELL cf:a timestamp=1375363287644, value=val1 1 row(s) in 0.0370 seconds Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 114. Using Data Browsers in Hue for HBase Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 115. Using Data Browsers in Hue for HBase Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 116. Recommendation to Further Study Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 117. Thank you www.imcinstitute.com www.facebook.com/imcinstitute Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013