Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
R hive tutorial supplement 3 - Rstudio-server setup for rhive
1. RHive tutorial – Rstudio-server setup
for RHive
This tutorial explains how to set up RStudio for using RHive more conveniently.
You can see a detailed how-to document about setting RStudio up at
http://rstudio.org.
A how-to for installing and using RStudio for RHive users is introduced here.
RHive is one of R packages that uses Hadoop and Hive for processing
massive data.
Though there are many R codes made with RHive that come up with results
and finish running in a short time, but if a code that processes extremely large
data is written, it may take a long time for it to finish analyzing and come up
with results.
Depending on the size of the data and the complexity of the processed
calculations, it can take anything from minutes at minimum to couple weeks at
maximum.
The problem here is that R’s session must be kept until the task started by the
user reaches completion.
If the user used a laptop to run the code then it must stay on and keep its
session until the code finishes. Even for desktops, it would be difficult for
desktops to reboot or anything similar while keeping its session until the task
is completed.
There are many other inconveniences stemming from having to keep the
session.
This problem, unrelated to RHive, also occurs when only using either Hadoop
or Hive, and RHive is no exception.
To solve this problem, you can also use a method of having a Hadoop client
opened, connect to the terminal, and run the code in the background.
But this is not that convenient for R users, and it is difficult to make use of the
convenience of the user’s IDE environment or the task environment in R.
Also, if the user is not familiar with using terminal then there is the
inconvenience of having to learn that.
RStudio is the best solution for this.
RStudio provides desktop and server versions but the desktop version is very
good for being an IDE for R.
And RStudio-server connects via a web browser and enables many people to
2. share common resources, and also has the advantage of being able to keep
the user’s session.
And if the Hadoop, Hive, RHive installed by the user are located in a restricted
network and so warrants approaching them through firewalls, then RStudio
port can be opened for that.
You can use RHive more conveniently if you use RStudio-server with RHive.
Lastly, since RStudio facilitates connecting to the server’s R environment, it
enables sharing of RHive, Hadoop, and Hive between multiple people.
This tutorial will demonstrate how to install, connect to, and use RStudio-
server.
Installing RStudio-server
RStudio can be downloaded from its official site.
http://rstudio.org/
RStudio’s official site, rstudio.org, provides documents detailing how to easily
install and use RStudio.
The page below gives a guide on the installation so it is equally fine to peruse
that instead of this tutorial.
http://rstudio.org/download/server
This tutorial explains how to install RStudio onto CentOS5.
The majority of this installation guide is cited from the aforementioned site,
with partial changes.
Of course, you must install R before installing RStudio-server
If you have read previous RHive tutorials and installed RHive accordingly,
then installation of R should already be complete.
But an explanation will be given here once more.
In order to install newest version of R, you should do the following.
$
sudo
rpm
-‐Uvh
http://download.fedora.redhat.com/pub/epel/5/i386/epel-‐release-‐
5-‐4.noarch.rpm
Now install R.
$
sudo
yum
install
R
R-‐devel
3. When installing RHive, remember to not only install R but R-devel as well.
Before installing RStudio-server, you must first know whether your server is of
a 32bit architecture or a 64bit architecture.
Recent servers would most likely be 64bit and you can confirm this via the
uname command.
uname
-‐m
x86_64
The above case confirms the server being of a 64bit architecture.
Now download the appropriate RStudio version for your architecture.
Installing for 32-bit:
$
wget
http://download2.rstudio.org/rstudio-‐server-‐0.94.110-‐
i686.rpm
$ sudo rpm -Uvh rstudio-server-0.94.110-i686.rpm
Installing for 64-bit:
$
wget
http://download2.rstudio.org/rstudio-‐server-‐0.94.110-‐
x86_64.rpm
$
sudo
rpm
-‐Uvh
rstudio-‐server-‐0.94.110-‐x86_64.rpm
Making a User Account
In order to connect to RStudio-server, a user account must exist in the server
where RStudio-server is installed.
As RStudio-server does not allow connecting via a root account, so accounts
for normal users are needed.
Connect to the server to create accounts for would-be users of RStudio-server
and set their passwords.
ssh
root@10.1.1.1
adduser
user1
passwd
user1
4. The user1 above is an arbitrarily named account, so name one to your liking.
Starting RStudio-server
RStudio-server must be run as a background process (Daemon mode).
Connect to the server like it is shown below
ssh
root@10.1.1.1
/etc/init.d/rstudioserver
start
You can easily run it like above.
Connecting to RStudio-server
You can use a web browser to connect to the RStudio-server.
Run your web browser and connect to the RStudio-server’s URL.
http://10.1.1.1:8787
The port that can connect to RStudio is set to be 8787 by default.
You can change this to something else as needed.
Now you can connect to RStutio-server and perform massive data analysis
with R and RHive.
Tips for using RHive in RStudio
While working in RStudio-server, you might experience failure in loading
RHive due to improper environment variables.
In this case you can solve this by adding a code that assigns values for
environment variables.
Sys.setenv(HADOOP_HOME="/mnt/srv/hadoop-‐0.20.203.0")
Sys.setenv(HIVE_HOME="/mnt/srv/hive-‐0.7.1")
Sys.setenv(RHIVE_DATA="/mnt/srv/rhive_data")
library(RHive)
The HADOOP_HOME mentioned above must have assigned to it the home
directories of Hadoop and Hive in the server where RStudio is installed.
And RHIVE_DATA refers to a temporary directory which RHive will use; it is
created in each Hadoop node.
5. The setting of environment variables should be done before loading RHive via
use of library functions.
If you have loaded RHive without setting the environment variables, then you
can set them and then use the rhive.init() function to initialize RHive.
library(RHive)
Sys.setenv(HADOOP_HOME="/mnt/srv/hadoop-‐0.20.203.0")
Sys.setenv(HIVE_HOME="/mnt/srv/hive-‐0.7.1")
Sys.setenv(RHIVE_DATA="/mnt/srv/rhive_data")
rhive.init()
Now you have written codes in R via RStudio, and finished the setup of an
environment that can use RHive to handle Hive and Hadoop.