SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Hands on Lab
Crunch Big Data in the Cloud with
IBM BigInsights and Hadoop
IBD-3475
Leons Petrazickis, IBM Canada
@leonsp

1
Table of Contents
Overview ............................................................................................................................. 3
Objectives ........................................................................................................................... 3
Pre-requisites....................................................................................................................... 4
Getting started ..................................................................................................................... 4
Part 1: Exploring HDFS ...................................................................................................... 8
Using the command line interface .................................................................................. 8
Part 2: MapReduce ............................................................................................................ 14
Running the WordCount program ................................................................................ 14
Working with Pig .............................................................................................................. 16
Working with Hive ........................................................................................................... 20
Working with Jaql ............................................................................................................. 24
Big Data University .......................................................................................................... 26
Communities ..................................................................................................................... 27
Thank You! ....................................................................................................................... 27
Acknowledgements and Disclaimers ................................................................................ 28

2
Overview
The overwhelming trend towards digital services, combined with cheap storage, has
generated massive amounts of data that enterprises need to effectively gather, process,
and analyze. Techniques from the data warehousing and high-performance computing
communities are invaluable for many enterprises. However, often times their cost or
complexity of scale-up discourages the accumulation of data without an immediate need.
As valuable knowledge may nevertheless be buried in this data, related scaled-up
technologies have been developed. Examples include Google’s MapReduce, and the
open-source implementation, Apache Hadoop.
Hadoop is an open-source project administered by the Apache Software Foundation.
Hadoop’s contributors work for some of the world’s biggest technology companies. That
diverse, motivated community has produced a collaborative platform for consolidating,
combining and understanding data.
Technically, Hadoop consists of two key services: data storage using the Hadoop
Distributed File System (HDFS) and large scale parallel data processing using a
technique called MapReduce.

Objectives
After completing this hands-on lab, you’ll be able to:


Use Hadoop commands to explore the HDFS on the Hadoop system



Use Hadoop commands to run a sample MapReduce program on the Hadoop
system



Explore Pig, Hive and Jaql

3
Pre-requisites
You should be familiar with working with the IM Demo Cloud as explained in Lab 0 –
Setup. IBM InfoSphere BigInsights has been configured at the IM Demo Cloud as
follows:


Hadoop (using BigInsights Enterprise) is installed in pseudo-distributed mode



Sample open source MapReduce examples and codes



Eclipse IDE is setup, though it may not be needed for this lab.

Getting started
Request a lab environment for yourself: http://bit.ly/requestLab
Accept the email invite to IM Demo Cloud.
After starting the Hadoop cluster from the IM Demo Cloud, a process that takes about 7
to 10 minutes, let the instructor know that your cluster is read. The instructor will preload
the data files used in this lab on your system.
Once the instructor has loaded the data files, click on the Launch Remote Desktop
Connection (VNC) icon

on the IM Demo Cloud dashboard (cluster tab).

Click on the red icon in the Firefox address bar to enable Java.

4
Accept that you want to Run the application:

This will open an x11 window as shown in the figure below:

5
Login with the following credentials:
Username:
Password:

biadmin
passw0rd

Click on the terminal icon in the x11 window:

This should open up a terminal as follows:

1. Change to the $BIGINSIGHTS_HOME (which by default is set to
/opt/ibm/biginsights).
cd $BIGINSIGHTS_HOME/bin
or
cd /opt/ibm/biginsights/bin
2. The BigInsights server in the IM Demo Cloud has most Hadoop components
(daemons) started for you. You can practice how to stop and start all components
with these commands. Please note they will take a few minutes to run, so if you
prefer, you can skip this step:

6
./stop-all.sh
./start-all.sh
3. The following figure shows the different Hadoop components starting.

NOTE: You may get an error that the server has not started, please be
patient as it does take some time for the server to complete start.

4. If you want to start or stop only a few components one at a time, use start.sh or
stop.sh respectively. For example, to start and stop Hadoop use:

7
./start.sh hadoop
./stop.sh hadoop
To start oozie, console, derby and sheets, use:
./start.sh
./start.sh
./start.sh
./start.sh

oozie
console
derby
sheets

Part 1: Exploring HDFS
Hadoop Distributed File System, HDFS, allows user data to be organized in the form of
files and directories. It provides a command line interface called FS shell that lets a user
interact with the data in HDFS accessible to Hadoop MapReduce programs.
There are two methods to interact with HDSF:
1.

You an use the command-line approach and invoke the FileSystem (fs) shell using
the format: hadoop fs <args>. This is the method we will use in this lab.

2.

You can also manipulate HDFS using the BigInsights Web Console. You will
explore the BigInsights Web Console on another lab.

Using the command line interface
In this part, we will explore some basic HDFS commands. All HDFS commands start
with hadoop followed by dfs (distributed file system) or fs (file system) followed by a
dash, and the command. Many HDFS commands are similar to UNIX commands. For
details, refer to the Hadoop Command Guide and Hadoop FS Shell Guide.
We will start with the hadoop fs –ls command which returns the list of files and
directories with permission information.
Usage: hadoop fs –ls <args>
For a file returns stat on the file with the following format:
permissions, number_of_replicas, userid, groupid, filesize, modification_date,
modification_time, filename
For a directory it returns list of its directory children as in UNIX. A directory is listed as
permissions, userid, groupid, modification_date, modification_time, dirname

Ensure the Hadoop components are all started, and from the same MindTerm terminal
window as before (and logged on as biadmin), follow these instructions:
5. To list the contents of the root directory, try:
> hadoop fs –ls /

8
6. To list the contents of the /user/biadmin directory, try:
>

hadoop fs –ls

or
>

hadoop fs –ls /user/biadmin

Note that in the first command there was no directory referenced, but it is equivalent to
the second command where /user/biadmin is explicitly specified. Each user will get its
own home directory under /user. For example, in the case of user biadmin, his home
directory is /user/biadmin. Any command where there is no explicit directory specified
will be relative to the user’s home directory.

7. To create the directory myTestDir you can issue the following command:
>

hadoop fs –mkdir myTestDir

Where was this directory created? As mentioned in the previous step, any relative
paths will be using the user’s home directory; therefore if you issue the following
command, you will see subdirectory myTestDir
>

hadoop fs –ls

or

9
>

hadoop fs –ls /user/biadmin

If you specify a relative path to hadoop fs commands, they will implicitly be
relative to your user directory in HDFS. For example when you created the
directory myTestDir, it was created in the /user/biadmin directory.

8. To use HDFS commands recursively generally you add an “r” to the HDFS command
(In the Linux shell this is generally done with the “-R” argument). For example, to do
a recursive listing we’ll use the –lsr command rather than just –ls. Try this:
>
>

hadoop fs –ls /user
hadoop fs –lsr /user

9. You can pipe (using the | character) any HDFS command to be used with the Linux
shell. For example, you can easily use grep with HDFS by doing the following:
>
>

hadoop fs –mkdir /user/biadmin/myTestDir2
hadoop fs –ls /user/biadmin | grep Test

10
As you can see the grep command only returned the lines which had test in them
(thus removing the “Found x items” line and the .staging and oozie-biad directories
from the listing
10. To move files between your regular Linux filesystem and HDFS you can use the put
and get commands. For example, move the text file README to the hadoop
filesystem.
> hadoop fs –put
/home/biadmin/bootcamp/input/lab01_HadoopCore/HDFS/README
README
> hadoop fs –ls /user/biadmin

You should now see a new file called /user/biadmin/README listed as shown above.
Note there is a ‘3’ highlighted in the figure. This represents the replication factor so
blocks of this file are replicated in 3 places.
11. In order to view the contents of this file use the –cat command as follows:
>

hadoop fs –cat README

You should see the output of the README file (that is stored in HDFS). We can also
use the linux diff command to see if the file we put on HDFS is actually the same as
the original on the local filesystem. You can do this as follows:
>

cd bootcamp/input/lab01_HadoopCore/HDFS

11
>

diff <( hadoop fs -cat README ) README

Since the diff command produces no output we know that the files are the same (the
diff command prints all the lines in the files that differ).
12. To find the size of files you need to use the –du or –dus commands. Keep in mind
that these commands return the file size in bytes. To find the size of the README
file use the following command:
>

hadoop fs –du README

In this example, the README file has 18 bytes.
13. To find the size of all files individually in the /user/biadmin directory use the following
command:
>

hadoop fs –du /user/biadmin

14. To find the size of all files in total of the /user/biadmin directory use the following
command:
>

hadoop fs –dus /user/biadminf

12
15. If you would like to get more information about hadoop fs commands, invoke –help
as follows:
>

hadoop fs –help

16. For specific help on a command, add the command name after help. For example, to
get help on the dus command you’d do the following:
>

hadoop fs –help dus

13
Part 2: MapReduce
Now that we’ve seen how the FileSystem (fs) shell can be used to execute Hadoop
commands to interact with HDFS, the same fs shell can be used to launch MapReduce
jobs. In this section, we will walk through the steps required to run a MapReduce
program. The source code for a MapReduce program is contained in a compiled .jar file.
Hadoop will load the JAR into HDFS and distribute it to the data nodes, where the
individual tasks of the MapReduce job will be executed. Hadoop ships with some
example MapReduce programs to run. One of these is a distributed WordCount program
which reads text files and counts how often words occur.

Running the WordCount program
17. First we need to copy the data files from the local file system to HDFS:
> hadoop fs –mkdir /user/biadmin/input
> hadoop fs -put
/home/biadmin/bootcamp/input/lab01_HadoopCore/MapReduce/*.csv
/user/biadmin/input

Review the files have been copied with
> hadoop fs –ls input

18. Now we can run the wordcount job with the following command, where
“/user/biadmin/input/” is where the input files are, and “output” is the directory where
the output of the job will be stored. The “output” directory will be created
automatically when executing the command below.
> hadoop jar /opt/ibm/biginsights/IHC/hadoop-examples1.0.3.jar wordcount /user/biadmin/input/ output

14
...

19.
>

Now review the output of step 2:
hadoop fs -ls output

In this case, the output was not split into multiple files. To view the contents of the
part-r-0000 file issue:
> hadoop fs –cat output/*00

15
…

Note:
You can use the BigInsights Web Console to run applications such as WordCount.
This same application (though with different input files) will be run again in the lab
describing the BigInsights Web Console. More detail about the job will also be
described then.

Working with Pig
In this tutorial, we are going to use Apache Pig to process the 1988 subset of the Google
Books 1-gram records to produce a histogram of the frequencies of words of each
length. A subset of this database (0.5 million records) has been stored in the file
googlebooks-1988.csv under
/home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql directory.
1. Let us examine the format of the Google Books 1-gram records:

16
cd /home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql
head -5 googlebooks-1988.csv

The columns these data represent are the word, the year, the number of occurrences
of that word in the corpus, the number of pages on which that word appeared, and
the number of books in which that word appeared.
2. Copy the data file into HDFS.
hadoop fs -put googlebooks-1988.csv
1988.csv

pighivejaql/googlebooks-

Note that directory /user/biadmin/pighivejaql is created automatically for you when
the above command is executed.
3. Start pig. If it has not been added to the PATH, you can add it, or switch to the
$PIG_HOME/bin directory
> cd $PIG_HOME/bin
> ./pig
grunt>

4. We are going to use a Pig UDF to compute the length of each word. The UDF is located
inside the piggybank.jar file (This jar file was created from the source, following the
instructions in https://cwiki.apache.org/confluence/display/PIG/PiggyBank, and copied

17
to the piggybank directory). We use the REGISTER command to load this jar file:
grunt> REGISTER
/opt/ibm/biginsights/pig/contrib/piggybank/java/piggybank.j
ar;
The first step in processing the data is to LOAD it.
grunt> records = LOAD ' pighivejaql/googlebooks-1988.csv'
AS (word:chararray, year:int, wordcount:int, pagecount:int,
bookcount:int);
This returns instantly. The processing is delayed until the data needs to be reported.
To produce a histogram, we want to group by the length of the word:
grunt> grouped = GROUP records by
org.apache.pig.piggybank.evaluation.string.LENGTH(word);
Sum the word counts for each word length using the SUM function with the FOREACH
GENERATE command.
grunt> final = FOREACH grouped GENERATE group,
SUM(records.wordcount);
Use the DUMP command to print the result to the console. This will cause all the
previous steps to be executed.
grunt> DUMP final;
This should produce output like the

18
following:

5. Quit pig.
grunt> quit

19
Working with Hive
In this tutorial, we are going to use Hive to process the 1988 subset of the Google Books
1-gram records to produce a histogram of the frequencies of words of each length. A
subset of this database (0.5 million records) has been stored in the file googlebooks1988.csv under /home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql
directory.
1. Ensure the Apache Derby component is started. Apache Derby is the default
database used as metastore in Hive. A quick way to verify if it is started, is to try
to start it using:
>
start.sh derby

2. Start hive interactively.
> cd $HIVE_HOME/bin
> ./hive
hive>

3. Create a table called wordlist.
hive> CREATE TABLE wordlist (word STRING, year INT,
wordcount INT, pagecount INT, bookcount INT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY 't';

20
4. Load the data from the googlebooks-1988.csv file into the wordlist table.
hive> LOAD DATA LOCAL INPATH
'/home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql/
googlebooks-1988.csv' OVERWRITE INTO TABLE wordlist;

5. Create a table named wordlengths to store the counts for each word length for
our histogram.
hive> CREATE TABLE wordlengths (wordlength INT, wordcount
INT);

6. Fill the wordlengths table with word length data from the wordlist table calculated
with the length function.
hive> INSERT OVERWRITE TABLE wordlengths SELECT
length(word), wordcount FROM wordlist;

21
7. Produce the histogram by summing the word counts grouped by word length.
hive> SELECT wordlength, sum(wordcount) FROM wordlengths
group by wordlength;

22
8. Quit hive.
hive> quit;

23
Working with Jaql
In this tutorial, we are going to use Jaql to process the 1988 subset of the Google Books
1-gram records to produce a histogram of the frequencies of words of each length. A
subset of this database (0.5 million records) has been stored in the file googlebooks1988.csv under /home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql
directory.
1. Let us examine the format of the Google Books 1-gram records:
> cd /home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql
> head -5 googlebooks-1988.del

The columns these data represent are the word, the year, the number of occurrences
of that word in the corpus, the number of pages on which that word appeared, and
the number of books in which that word appeared.
2. Copy the googlebooks-1988.del file to HDFS.
> hadoop fs -put googlebooks-1988.del googlebooks-1988.del

3. Start the Jaql shell using the -c option to use the running Hadoop cluster.
> cd $JAQL_HOME/bin
> ./jaqlshell -c
jaql>

24
4. Read the comma delimited file from HDFS.
jaql> $wordlist = read(del("googlebooks-1988.del", {
schema: schema { word: string, year: long, wordcount: long,
pagecount: long, bookcount: long } }));

5. Transform each word into its length by applying the strLen function.
jaql> $wordlengths = $wordlist -> transform { wordlength:
strLen($.word), wordcount: $.wordcount };

6. Produce the histogram by summing the word counts grouped by word length.
jaql> $wordlengths -> group by $word = {$.wordlength} into
{ $word.wordlength, counts: sum($[*].wordcount) };
This should produce output like the following:

25
7. Quit Jaql.
jaql> quit;

Note:
Another lab in this Bootcamp will discuss JAQL in more detail

Big Data University
Materials for this lab have been adapted from the Hadoop Essentials course on
http://BigDataUniversity.com. You are encouraged to follow the course for further study.

26
Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity
• IBM Champions
o Recognizing individuals who have made the most outstanding
contributions to Information Management, Business Analytics, and
Enterprise Content Management communities
• ibm.com/champion

Thank You!

Your Feedback is Important!
• Access the Conference Agenda Builder to complete your session
surveys
o Any web or mobile browser at
http://iod13surveys.com/surveys.html
o Any Agenda Builder kiosk onsite

27
Acknowledgements and Disclaimers:
Availability: References in this presentation to IBM products, programs, or services do not
imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and
reflect their own views. They are provided for informational purposes only, and are neither
intended to, nor shall have the effect of being, legal or other guidance or advice to any
participant. While efforts were made to verify the completeness and accuracy of the
information contained in this presentation, it is provided AS-IS without warranty of any kind,
express or implied. IBM shall not be responsible for any damages arising out of the use of, or
otherwise related to, this presentation or any other materials. Nothing contained in this
presentation is intended to, nor shall have the effect of, creating any warranties or
representations from IBM or its suppliers or licensors, or altering the terms and conditions of
the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have
used IBM products and the results they may have achieved. Actual environmental costs and
performance characteristics may vary by customer. Nothing contained in these materials is
intended to, nor shall have the effect of, stating or implying that any activities undertaken by
you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2013. All rights reserved.
•
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, InfoSphere and BigInsights are trademarks or
registered trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other IBM trademarked terms
are marked on their first occurrence in this information with a trademark symbol
(® or ™), these symbols indicate U.S. registered or common law trademarks
owned by IBM at the time this information was published. Such trademarks may
also be registered or common law trademarks in other countries. A current list of
IBM trademarks is available on the Web at “Copyright and trademark
information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of
others.

28

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 

Was ist angesagt? (20)

Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 

Ähnlich wie IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab steps

Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study labCynthia Saracco
 
UserGuideHDFS_FinalDocument
UserGuideHDFS_FinalDocumentUserGuideHDFS_FinalDocument
UserGuideHDFS_FinalDocumentAnna Ellis
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentFarzad Nozarian
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Rh202 q&amp;a-demo-cert magic
Rh202 q&amp;a-demo-cert magicRh202 q&amp;a-demo-cert magic
Rh202 q&amp;a-demo-cert magicEllina Beckman
 
250hadoopinterviewquestions
250hadoopinterviewquestions250hadoopinterviewquestions
250hadoopinterviewquestionsRamana Swamy
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase clientShashwat Shriparv
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)Isabella789
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
Configuring and manipulating HDFS files
Configuring and manipulating HDFS filesConfiguring and manipulating HDFS files
Configuring and manipulating HDFS filesRupak Roy
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 
Webinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksWebinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksEdureka!
 
Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands SimoniShah6
 
RH302 Exam-Red Hat Linux Certification
RH302 Exam-Red Hat Linux CertificationRH302 Exam-Red Hat Linux Certification
RH302 Exam-Red Hat Linux CertificationIsabella789
 

Ähnlich wie IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab steps (20)

Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study lab
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
UserGuideHDFS_FinalDocument
UserGuideHDFS_FinalDocumentUserGuideHDFS_FinalDocument
UserGuideHDFS_FinalDocument
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Rh202 q&amp;a-demo-cert magic
Rh202 q&amp;a-demo-cert magicRh202 q&amp;a-demo-cert magic
Rh202 q&amp;a-demo-cert magic
 
250hadoopinterviewquestions
250hadoopinterviewquestions250hadoopinterviewquestions
250hadoopinterviewquestions
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Configuring and manipulating HDFS files
Configuring and manipulating HDFS filesConfiguring and manipulating HDFS files
Configuring and manipulating HDFS files
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Webinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksWebinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin Tasks
 
Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands
 
RH302 Exam-Red Hat Linux Certification
RH302 Exam-Red Hat Linux CertificationRH302 Exam-Red Hat Linux Certification
RH302 Exam-Red Hat Linux Certification
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab steps

  • 1. Hands on Lab Crunch Big Data in the Cloud with IBM BigInsights and Hadoop IBD-3475 Leons Petrazickis, IBM Canada @leonsp 1
  • 2. Table of Contents Overview ............................................................................................................................. 3 Objectives ........................................................................................................................... 3 Pre-requisites....................................................................................................................... 4 Getting started ..................................................................................................................... 4 Part 1: Exploring HDFS ...................................................................................................... 8 Using the command line interface .................................................................................. 8 Part 2: MapReduce ............................................................................................................ 14 Running the WordCount program ................................................................................ 14 Working with Pig .............................................................................................................. 16 Working with Hive ........................................................................................................... 20 Working with Jaql ............................................................................................................. 24 Big Data University .......................................................................................................... 26 Communities ..................................................................................................................... 27 Thank You! ....................................................................................................................... 27 Acknowledgements and Disclaimers ................................................................................ 28 2
  • 3. Overview The overwhelming trend towards digital services, combined with cheap storage, has generated massive amounts of data that enterprises need to effectively gather, process, and analyze. Techniques from the data warehousing and high-performance computing communities are invaluable for many enterprises. However, often times their cost or complexity of scale-up discourages the accumulation of data without an immediate need. As valuable knowledge may nevertheless be buried in this data, related scaled-up technologies have been developed. Examples include Google’s MapReduce, and the open-source implementation, Apache Hadoop. Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a collaborative platform for consolidating, combining and understanding data. Technically, Hadoop consists of two key services: data storage using the Hadoop Distributed File System (HDFS) and large scale parallel data processing using a technique called MapReduce. Objectives After completing this hands-on lab, you’ll be able to:  Use Hadoop commands to explore the HDFS on the Hadoop system  Use Hadoop commands to run a sample MapReduce program on the Hadoop system  Explore Pig, Hive and Jaql 3
  • 4. Pre-requisites You should be familiar with working with the IM Demo Cloud as explained in Lab 0 – Setup. IBM InfoSphere BigInsights has been configured at the IM Demo Cloud as follows:  Hadoop (using BigInsights Enterprise) is installed in pseudo-distributed mode  Sample open source MapReduce examples and codes  Eclipse IDE is setup, though it may not be needed for this lab. Getting started Request a lab environment for yourself: http://bit.ly/requestLab Accept the email invite to IM Demo Cloud. After starting the Hadoop cluster from the IM Demo Cloud, a process that takes about 7 to 10 minutes, let the instructor know that your cluster is read. The instructor will preload the data files used in this lab on your system. Once the instructor has loaded the data files, click on the Launch Remote Desktop Connection (VNC) icon on the IM Demo Cloud dashboard (cluster tab). Click on the red icon in the Firefox address bar to enable Java. 4
  • 5. Accept that you want to Run the application: This will open an x11 window as shown in the figure below: 5
  • 6. Login with the following credentials: Username: Password: biadmin passw0rd Click on the terminal icon in the x11 window: This should open up a terminal as follows: 1. Change to the $BIGINSIGHTS_HOME (which by default is set to /opt/ibm/biginsights). cd $BIGINSIGHTS_HOME/bin or cd /opt/ibm/biginsights/bin 2. The BigInsights server in the IM Demo Cloud has most Hadoop components (daemons) started for you. You can practice how to stop and start all components with these commands. Please note they will take a few minutes to run, so if you prefer, you can skip this step: 6
  • 7. ./stop-all.sh ./start-all.sh 3. The following figure shows the different Hadoop components starting. NOTE: You may get an error that the server has not started, please be patient as it does take some time for the server to complete start. 4. If you want to start or stop only a few components one at a time, use start.sh or stop.sh respectively. For example, to start and stop Hadoop use: 7
  • 8. ./start.sh hadoop ./stop.sh hadoop To start oozie, console, derby and sheets, use: ./start.sh ./start.sh ./start.sh ./start.sh oozie console derby sheets Part 1: Exploring HDFS Hadoop Distributed File System, HDFS, allows user data to be organized in the form of files and directories. It provides a command line interface called FS shell that lets a user interact with the data in HDFS accessible to Hadoop MapReduce programs. There are two methods to interact with HDSF: 1. You an use the command-line approach and invoke the FileSystem (fs) shell using the format: hadoop fs <args>. This is the method we will use in this lab. 2. You can also manipulate HDFS using the BigInsights Web Console. You will explore the BigInsights Web Console on another lab. Using the command line interface In this part, we will explore some basic HDFS commands. All HDFS commands start with hadoop followed by dfs (distributed file system) or fs (file system) followed by a dash, and the command. Many HDFS commands are similar to UNIX commands. For details, refer to the Hadoop Command Guide and Hadoop FS Shell Guide. We will start with the hadoop fs –ls command which returns the list of files and directories with permission information. Usage: hadoop fs –ls <args> For a file returns stat on the file with the following format: permissions, number_of_replicas, userid, groupid, filesize, modification_date, modification_time, filename For a directory it returns list of its directory children as in UNIX. A directory is listed as permissions, userid, groupid, modification_date, modification_time, dirname Ensure the Hadoop components are all started, and from the same MindTerm terminal window as before (and logged on as biadmin), follow these instructions: 5. To list the contents of the root directory, try: > hadoop fs –ls / 8
  • 9. 6. To list the contents of the /user/biadmin directory, try: > hadoop fs –ls or > hadoop fs –ls /user/biadmin Note that in the first command there was no directory referenced, but it is equivalent to the second command where /user/biadmin is explicitly specified. Each user will get its own home directory under /user. For example, in the case of user biadmin, his home directory is /user/biadmin. Any command where there is no explicit directory specified will be relative to the user’s home directory. 7. To create the directory myTestDir you can issue the following command: > hadoop fs –mkdir myTestDir Where was this directory created? As mentioned in the previous step, any relative paths will be using the user’s home directory; therefore if you issue the following command, you will see subdirectory myTestDir > hadoop fs –ls or 9
  • 10. > hadoop fs –ls /user/biadmin If you specify a relative path to hadoop fs commands, they will implicitly be relative to your user directory in HDFS. For example when you created the directory myTestDir, it was created in the /user/biadmin directory. 8. To use HDFS commands recursively generally you add an “r” to the HDFS command (In the Linux shell this is generally done with the “-R” argument). For example, to do a recursive listing we’ll use the –lsr command rather than just –ls. Try this: > > hadoop fs –ls /user hadoop fs –lsr /user 9. You can pipe (using the | character) any HDFS command to be used with the Linux shell. For example, you can easily use grep with HDFS by doing the following: > > hadoop fs –mkdir /user/biadmin/myTestDir2 hadoop fs –ls /user/biadmin | grep Test 10
  • 11. As you can see the grep command only returned the lines which had test in them (thus removing the “Found x items” line and the .staging and oozie-biad directories from the listing 10. To move files between your regular Linux filesystem and HDFS you can use the put and get commands. For example, move the text file README to the hadoop filesystem. > hadoop fs –put /home/biadmin/bootcamp/input/lab01_HadoopCore/HDFS/README README > hadoop fs –ls /user/biadmin You should now see a new file called /user/biadmin/README listed as shown above. Note there is a ‘3’ highlighted in the figure. This represents the replication factor so blocks of this file are replicated in 3 places. 11. In order to view the contents of this file use the –cat command as follows: > hadoop fs –cat README You should see the output of the README file (that is stored in HDFS). We can also use the linux diff command to see if the file we put on HDFS is actually the same as the original on the local filesystem. You can do this as follows: > cd bootcamp/input/lab01_HadoopCore/HDFS 11
  • 12. > diff <( hadoop fs -cat README ) README Since the diff command produces no output we know that the files are the same (the diff command prints all the lines in the files that differ). 12. To find the size of files you need to use the –du or –dus commands. Keep in mind that these commands return the file size in bytes. To find the size of the README file use the following command: > hadoop fs –du README In this example, the README file has 18 bytes. 13. To find the size of all files individually in the /user/biadmin directory use the following command: > hadoop fs –du /user/biadmin 14. To find the size of all files in total of the /user/biadmin directory use the following command: > hadoop fs –dus /user/biadminf 12
  • 13. 15. If you would like to get more information about hadoop fs commands, invoke –help as follows: > hadoop fs –help 16. For specific help on a command, add the command name after help. For example, to get help on the dus command you’d do the following: > hadoop fs –help dus 13
  • 14. Part 2: MapReduce Now that we’ve seen how the FileSystem (fs) shell can be used to execute Hadoop commands to interact with HDFS, the same fs shell can be used to launch MapReduce jobs. In this section, we will walk through the steps required to run a MapReduce program. The source code for a MapReduce program is contained in a compiled .jar file. Hadoop will load the JAR into HDFS and distribute it to the data nodes, where the individual tasks of the MapReduce job will be executed. Hadoop ships with some example MapReduce programs to run. One of these is a distributed WordCount program which reads text files and counts how often words occur. Running the WordCount program 17. First we need to copy the data files from the local file system to HDFS: > hadoop fs –mkdir /user/biadmin/input > hadoop fs -put /home/biadmin/bootcamp/input/lab01_HadoopCore/MapReduce/*.csv /user/biadmin/input Review the files have been copied with > hadoop fs –ls input 18. Now we can run the wordcount job with the following command, where “/user/biadmin/input/” is where the input files are, and “output” is the directory where the output of the job will be stored. The “output” directory will be created automatically when executing the command below. > hadoop jar /opt/ibm/biginsights/IHC/hadoop-examples1.0.3.jar wordcount /user/biadmin/input/ output 14
  • 15. ... 19. > Now review the output of step 2: hadoop fs -ls output In this case, the output was not split into multiple files. To view the contents of the part-r-0000 file issue: > hadoop fs –cat output/*00 15
  • 16. … Note: You can use the BigInsights Web Console to run applications such as WordCount. This same application (though with different input files) will be run again in the lab describing the BigInsights Web Console. More detail about the job will also be described then. Working with Pig In this tutorial, we are going to use Apache Pig to process the 1988 subset of the Google Books 1-gram records to produce a histogram of the frequencies of words of each length. A subset of this database (0.5 million records) has been stored in the file googlebooks-1988.csv under /home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql directory. 1. Let us examine the format of the Google Books 1-gram records: 16
  • 17. cd /home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql head -5 googlebooks-1988.csv The columns these data represent are the word, the year, the number of occurrences of that word in the corpus, the number of pages on which that word appeared, and the number of books in which that word appeared. 2. Copy the data file into HDFS. hadoop fs -put googlebooks-1988.csv 1988.csv pighivejaql/googlebooks- Note that directory /user/biadmin/pighivejaql is created automatically for you when the above command is executed. 3. Start pig. If it has not been added to the PATH, you can add it, or switch to the $PIG_HOME/bin directory > cd $PIG_HOME/bin > ./pig grunt> 4. We are going to use a Pig UDF to compute the length of each word. The UDF is located inside the piggybank.jar file (This jar file was created from the source, following the instructions in https://cwiki.apache.org/confluence/display/PIG/PiggyBank, and copied 17
  • 18. to the piggybank directory). We use the REGISTER command to load this jar file: grunt> REGISTER /opt/ibm/biginsights/pig/contrib/piggybank/java/piggybank.j ar; The first step in processing the data is to LOAD it. grunt> records = LOAD ' pighivejaql/googlebooks-1988.csv' AS (word:chararray, year:int, wordcount:int, pagecount:int, bookcount:int); This returns instantly. The processing is delayed until the data needs to be reported. To produce a histogram, we want to group by the length of the word: grunt> grouped = GROUP records by org.apache.pig.piggybank.evaluation.string.LENGTH(word); Sum the word counts for each word length using the SUM function with the FOREACH GENERATE command. grunt> final = FOREACH grouped GENERATE group, SUM(records.wordcount); Use the DUMP command to print the result to the console. This will cause all the previous steps to be executed. grunt> DUMP final; This should produce output like the 18
  • 20. Working with Hive In this tutorial, we are going to use Hive to process the 1988 subset of the Google Books 1-gram records to produce a histogram of the frequencies of words of each length. A subset of this database (0.5 million records) has been stored in the file googlebooks1988.csv under /home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql directory. 1. Ensure the Apache Derby component is started. Apache Derby is the default database used as metastore in Hive. A quick way to verify if it is started, is to try to start it using: > start.sh derby 2. Start hive interactively. > cd $HIVE_HOME/bin > ./hive hive> 3. Create a table called wordlist. hive> CREATE TABLE wordlist (word STRING, year INT, wordcount INT, pagecount INT, bookcount INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; 20
  • 21. 4. Load the data from the googlebooks-1988.csv file into the wordlist table. hive> LOAD DATA LOCAL INPATH '/home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql/ googlebooks-1988.csv' OVERWRITE INTO TABLE wordlist; 5. Create a table named wordlengths to store the counts for each word length for our histogram. hive> CREATE TABLE wordlengths (wordlength INT, wordcount INT); 6. Fill the wordlengths table with word length data from the wordlist table calculated with the length function. hive> INSERT OVERWRITE TABLE wordlengths SELECT length(word), wordcount FROM wordlist; 21
  • 22. 7. Produce the histogram by summing the word counts grouped by word length. hive> SELECT wordlength, sum(wordcount) FROM wordlengths group by wordlength; 22
  • 24. Working with Jaql In this tutorial, we are going to use Jaql to process the 1988 subset of the Google Books 1-gram records to produce a histogram of the frequencies of words of each length. A subset of this database (0.5 million records) has been stored in the file googlebooks1988.csv under /home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql directory. 1. Let us examine the format of the Google Books 1-gram records: > cd /home/biadmin/bootcamp/input/lab01_HadoopCore/PigHiveJaql > head -5 googlebooks-1988.del The columns these data represent are the word, the year, the number of occurrences of that word in the corpus, the number of pages on which that word appeared, and the number of books in which that word appeared. 2. Copy the googlebooks-1988.del file to HDFS. > hadoop fs -put googlebooks-1988.del googlebooks-1988.del 3. Start the Jaql shell using the -c option to use the running Hadoop cluster. > cd $JAQL_HOME/bin > ./jaqlshell -c jaql> 24
  • 25. 4. Read the comma delimited file from HDFS. jaql> $wordlist = read(del("googlebooks-1988.del", { schema: schema { word: string, year: long, wordcount: long, pagecount: long, bookcount: long } })); 5. Transform each word into its length by applying the strLen function. jaql> $wordlengths = $wordlist -> transform { wordlength: strLen($.word), wordcount: $.wordcount }; 6. Produce the histogram by summing the word counts grouped by word length. jaql> $wordlengths -> group by $word = {$.wordlength} into { $word.wordlength, counts: sum($[*].wordcount) }; This should produce output like the following: 25
  • 26. 7. Quit Jaql. jaql> quit; Note: Another lab in this Bootcamp will discuss JAQL in more detail Big Data University Materials for this lab have been adapted from the Hadoop Essentials course on http://BigDataUniversity.com. You are encouraged to follow the course for further study. 26
  • 27. Communities • On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more o Find the community that interests you … • Information Management bit.ly/InfoMgmtCommunity • Business Analytics bit.ly/AnalyticsCommunity • Enterprise Content Management bit.ly/ECMCommunity • IBM Champions o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities • ibm.com/champion Thank You! Your Feedback is Important! • Access the Conference Agenda Builder to complete your session surveys o Any web or mobile browser at http://iod13surveys.com/surveys.html o Any Agenda Builder kiosk onsite 27
  • 28. Acknowledgements and Disclaimers: Availability: References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2013. All rights reserved. • U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, InfoSphere and BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others. 28