Acknowledge R Studio working with BIg Data, Import & Export and R-Hadoop and distinguish between the base functions vs big data functions like read.csv.fffdf with optimized memory management.
Let me know if anything is required. Ping @ bobrupakroy
2. Working with Big Data
R provides two ways to work with Big Data, one by using
R-hadoop functions and an another is R’s in-built base packages
and functions by using systems RAM.
But the problem with the R’s in-built base functions that it can
handle the amount of data based on system’s RAM availability.
Therefore higher system memory provides better performance.
One of the common errors related to memory in R will show: cant
allocate vector of size i.e. error due to memory limitation.
So R developers created special packages and functions to handle
big data in R through better memory management.
Rupak Roy
3. R-Hadoop
R-hadoop is also an another function to integrate R programming
language with hadoop.
Due to its limit of handling data based on system’s RAM
availability R uses special packages and functions to send back
and forth to process the user instructions using hadoop
framework.
The reasons why R-hadoop is good fit for big data analytics:
Its an interactive language.
It is also useful for advance data visualizations.
Can easily implement statistical programming features like
predictive analysis.
#to know more about integrating the R and the hadoop follow
our big data Analytics module.
4. fread()
R’s special packages and functions to read big data:
1. fread(): similar to read.table in terms of functionality but faster and
effective with more parameters.
All the controls such as sep, colClasses and nrows are automatically
detected. Integer data types are also detected and read directly. Dates
are read as character and can be converted afterwards using the
time package or standard R base functions.
>bigdata<-fread(input, sep=“auto”, header= “auto”, nrow= -1L,
stringAsFactors= FALSE,……..);
Where as
input= file name to read
nrow= -1L the number of rows to read, by default -1 means all.
5. Base functions Vs fread()
Using standard R base function
>system.time(store<-read.table(“store.csv”, header=T, sep=“,”,fill=TRUE,
nrows=28000) )
where, fill =If TRUE then in case the rows have unequal length, blank fields are
implicitly added.
user system elapsed
0.50 0.00 0.52 #output
-------------------------------------------------------------
>install.packages(“data.table”) #if the package is not installed
>library(data.table) #load the fread function from data.table package
>system.time(store<-fread(“store.csv”,header= “auto”, sep=“auto”, nrows
=28000));
user system elapsed
0.05 0.00 0.05 #output
System.time(): will give us the system’s process time to execute the code.
?data.table::fread - it’s a wrapper function of read.table to read big data in an
effective and efficient way. To know more about the features of fread() use
>?read.table::fread
6. read.csv.sql()
2. read.csv.sql(): Reads the file by filtering it with an sql
statement so that it can handle large files in R
>bigdata<- read.csv.sql(file, sql= “…”, header = T, sep=“,”, nrows,
row.names, skip,…………….)
Where
file = name of the file to read
sql = sql statements to filter
header, sep = as in read.csv
nrows, rows.names, skip = as in read.csv
Rupak Roy
7. Base functions Vs read.csv.sql()
Using standard R base function
>system.time(crimedata<-read.table("crime_data.csv", header=T,
sep=","))
0.00 0.00. 0.08 #output
-------------------------------------------------------------
>install.packages(“sqldf”) #if the package is not installed
>library(sqldf) #load the function from sqldf package
>system.time(crimedata<-read.csv.sql(“crime_data.csv”, sql=“select
*from file where Assault >=10”, header= T, sep=“,”));
user system elapsed
0.05 0.00 0.05 #output
?sqldf::read.csv.sql- it is again a wrapper function of read.csv but it adds
the rich features of a structured query language (sql) to segregate the
data to handle large files. To know more about the features of
read.csv.sqldf use > ?sqldf::read.csv.sqldf
8. read.csv.ffdf()
3. read.csv.ffdf(): reads input file data into ffdf (ff data frame) objects,
very much like (and using) read.csv and read.table but with more
effective memory management than standard functions.
>bigdata<- read.csv.ffdf(file= “file.csv”, header= F, Verbose = T,
first.rows= 30000, next.rows= 30000)
where
file = the name of the file which the data are to be read from.
verbose = show timings for each processed chunk (default FALSE)
first.rows = number of rows to be read in the first chunk
next.rows = number of rows to be read in further chunks
Rupak Roy
9. Base functions Vs read.csv.ffdf()
>install.packages(“ff”) #if the package is not installed
>library(ff) #load the function from ff package
>system.time(bigdata<- read.csv.ffdf(file="store.csv", header= T, VERBOSE = T,
first.rows =40000, next.rows=9000,
colClasses=c("factor","factor","factor","numeric","factor")))
We can observe the verbose for the first chunk of 1 to 40,000 rows took
0.47sec and for the next 9000 rows 40,001 to 49,000 took 0.19 sec and so on.
?ff::read.table.ffdf - It can work with any convenience wrappers like read.csv
and it reads large files in row chunks. The first chunk is read with a default of
1,000 rows, for subsequent chunks it adjusts to RAM availability. To know more
about the features of read.table.ffdf use > ?ff::read.table.ffdf
10. Exporting Big Data
We can also use our base R functions to export the big data like
write.csv() and write.table()
In addition to this,
write.csv.ffdf() also exports the ff df (data frames) into text files.
Rupak Roy
11. Troubleshoot errors
Important points to remember:
1. Error in scan…… lines did not have 5 elements.
If the rows have unequal length it will throw an error while importing the file.
The solution to this is to use FILL = TRUE, to indicate if the rows have unequal
length then fill it with blank spaces.
So the correct code will be
Store<-read.table(“store.csv”, header = T,sep=“ , ”, nrows =661,
blank.lines.skip = T, fill = TRUE)
Rupak Roy
12. Troubleshoot errors
2. Error in ff…… vmode character not implemented.
This is because it doesn’t support character vectors, so it needs to be stored as
factors. The disadvantage of this is the levels are stored in the RAM, so if there
are large number of levels, might cause memory problems.
And also integer doesn’t work.
So the correct code will be
bigdata<- read.csv.ffdf(file="store.csv", header= T, VERBOSE = T, first.rows
=40000, next.rows=9000, colClasses=c("factor","factor","factor",
"numeric","factor"))
Rupak Roy
13. Next:
We will learn how to import, export and read directly
the worksheets of an excel file.
Import and export Big Data
Rupak Roy