1. RHive tutorial - HDFS functions
Hive uses Hadoop’s system to process distributed file systems.
Thus, in order to expertly use Hive and RHive,
you must be able to do things along the lines of using HDFS to put, get, and
remove big data.
RHive possesses Functions that correspond to what the “hadoop fs”
command supports.
Using these Functions, a user can in R environment handle HDFS without
using HADOOP CLI(command line interface) or Hadoop HDFS library.
If you find yourself more comfortable with using “hadoop”’s CLI or Hadoop
library then it is also fine to use them.
But if you are not familiar with using Rstudio server or working from a terminal,
RHive HDFS Functions should prove to be easy-to-use solutions in handling
HDFS for R users.
Before Emulating this Example
rhive.hdfs.* Functions work after RHive has successfully been installed and
library(Rhive) and rhive.connect are successfully executed.
Let’s not forget to do the following before emulating the example.
#
Open
R
library(RHive)
rhive.connect()
rhive.hdfs.connect
In order to use RHive Functions to use HDFS, a connection to hdfs must be
established.
But if the Hadoop configuration for HDFS is properly set and rhive.connect
Function is executed, then this Function will automatically be
processed/executed* so there is no need to have this separately executed.
If you need to connect to a different HDFS then you can do it like this:
rhive.hdfs.connect("hdfs://10.1.1.1:9000")
[1]
"Java-‐Object{DFS[DFSClient[clientName=DFSClient_630489789,
ugi=root]]}"
2. The connection will fail to establish itself if you do not insert the exact
hostname and port number servicing HDFS.
Ask the system manager if you do not have this information.
rhive.hdfs.ls
This does the same thing as "hadoop fs -ls" and this is used like this.
rhive.hdfs.ls("/")
permission
owner
group
length
modify-‐
time
file
1
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
14:27
/airline
2
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
13:16
/benchmarks
3
rw-‐r-‐-‐r-‐-‐
root
supergroup
11186419
2011-‐12-‐06
03:59
/messages
4
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
22:05
/mnt
5
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐13
20:24
/rhive
6
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
20:19
/tmp
7
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐14
01:14
/user
This is the same as the command which uses Hadoop CLI.
hadoop
fs
-‐ls
/
rhive.hdfs.get
The rhive.hdfs.get Function’s role is to bring the data in HDFS to local.
This functions in the same way as "hadoop fs -get".
The next example entails taking messages data in HDFS and saving them to
local system’s /tmp/messages, then checking the number of Records.
rhive.hdfs.get("/messages",
"/tmp/messages")
3. [1]
TRUE
system("wc
-‐l
/tmp/messages")
145889
/tmp/messages
rhive.hdfs.put
The rhive.hdfs.put Function uploads all data in local to HDFS.
This functions like "hadoop fs -put" and opposite of rhive.hdfs.get.
The following example uploads the “/tmp/messages” in local system to
“/messages_new” in HDFS.
rhive.hdfs.put("/tmp/messages",
"/messages_new")
rhive.hdfs.ls("/")
permission
owner
group
length
modify-‐
time
file
1
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
14:27
/airline
2
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
13:16
/benchmarks
3
rw-‐r-‐-‐r-‐-‐
root
supergroup
11186419
2011-‐12-‐06
03:59
/messages
4
rw-‐r-‐-‐r-‐-‐
root
supergroup
11186419
2011-‐12-‐14
02:02
/messages_new
5
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
22:05
/mnt
6
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐13
20:24
/rhive
7
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐14
01:14
/user
You can see a new file, "/messages_new", now appears in HDFS.
rhive.hdfs.rm
This does the same thing as "hadoop fs -rm", deleting files in HDFS.
4. rhive.hdfs.rm("/messages_new")
rhive.hdfs.ls("/")
permission
owner
group
length
modify-‐
time
file
1
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
14:27
/airline
2
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
13:16
/benchmarks
3
rw-‐r-‐-‐r-‐-‐
root
supergroup
11186419
2011-‐12-‐06
03:59
/messages
4
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
22:05
/mnt
5
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐13
20:24
/rhive
6
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐14
01:14
/user
You can see the "/messages_new" file has been deleted from within HDFS.
rhive.hdfs.rename
This does the same thing as "hadoop fs -mv".
That is, it changes the file name for files in HDFS or moves directories.
rhive.hdfs.rename("/messages",
"/messages_renamed")
[1]
TRUE
rhive.hdfs.ls("/")
permission
owner
group
length
modify-‐
time
file
1
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
14:27
/airline
2
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
13:16
/benchmarks
3
rw-‐r-‐-‐r-‐-‐
root
supergroup
11186419
2011-‐12-‐06
03:59
/messages_renamed
4
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
22:05
/mnt
5. 5
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐13
20:24
/rhive
6
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐14
01:14
/user
rhive.hdfs.exists
This checks whether a file exists within HDFS. There is no corresponding
command hadoop that serves as a counterpart.
rhive.hdfs.exists("/messages_renamed")
[1]
TRUE
rhive.hdfs.exists("/foobar")
[1]
FALSE
rhive.hdfs.mkdirs
This does the same thing as "hadoop fs -mkdir".
This makes directories in HDFS, even subdirectories.
rhive.hdfs.mkdirs("/newdir/newsubdir")
[1]
TRUE
rhive.hdfs.ls("/")
permission
owner
group
length
modify-‐
time
file
1
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
14:27
/airline
2
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
13:16
/benchmarks
3
rw-‐r-‐-‐r-‐-‐
root
supergroup
11186419
2011-‐12-‐06
03:59
/messages_renamed
4
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
22:05
/mnt
5
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐14
02:13
/newdir
6
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐13
6. 20:24
/rhive
7
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐14
01:14
/user
rhive.hdfs.ls("/newdir")
permission
owner
group
length
modify-‐
time
file
1
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐14
02:13
/newdir/newsubdir
rhive.hdfs.close
This is used to close the connection when you have completed using HDFS
and no longer need to use it.
rhive.hdfs.close()