1. R SEMINAR
Antony Karanja N.
Research Methods Group, ICRAF
2nd April, 15
Data Management and Analysis
2. AIM
• Recap on the steps and tips to R learning to
code
• Introduction to dplyr package
• How to utilize dplyr package for data
manipulation* and basic statistics
• Ultimate: dplyr and ggplot2
3. RECAP
• Set working directory (creating project, setwd)
• Installing and calling library packages
• Reading/loading data (read.???)
• What is the R object type (class)
• Variables within data frames
• Knowing which Data type are the variables
• View head and tail data
4. RECAP###################
# IMPORT datasets #
###################
tree<-read.csv(file="datavis.csv",header=T)
#-------------------------
# Inspect data with head()
#-------------------------
names(tree);colnames(tree)
head(tree)
tail(tree)
#-------------------------
# Inspect R object type
#-------------------------
class(tree)
#-------------------------
# Inspect Internal structure of R object type
#-------------------------
str(tree)
glimpse(tree)
#-------------------------
# Inspect data types
#-------------------------
sapply(tree,class) #-horizontal view
lapply(tree,class) #-Vertical view
##############################
# LOOK FOR DUPLICATE RECORDS #
##############################
duplicates<-tree[anyDuplicated(tree[c("Country","Site","PosTopoSeq")]),] #Base function
6. filter()
• filter() allows you to select a subset of the rows of a
data frame.
• filter() works similarly to subset()
• Filter(FD, condition(s))
#1.0 #### filter - By and (use comma) or use |
table(tree$Country)
Nicaragua<-filter(tree, Country == "Nicaragua")
SA<-filter(tree, Country == "South Africa")
#1.1 #### slice
Nicaragua2<-slice(tree, 1:16)
7. arrange()
• arrange() works similarly to filter() except that
instead of filtering or selecting rows, it reorders
them.
#2.0 #### arrange
arrange(tree, Site,PosTopoSeq,VegStructure)
tree_arr<-arrange(tree, Site,PosTopoSeq,VegStructure)
tree_arr<-arrange(tree, desc(Site),PosTopoSeq,VegStructure)
8. select()
• Very helpful when working with dataset with many
columns/variables
• Helper function within select() include starts_with(),
ends_with(), matches() and contains()
#2.0 #### select
tree_select<-select(tree,Country,SEVEREERO,avSlope,avTreeDen,Carbon,pH,Clay)
tree_select<-select(tree,Country,SEVEREERO,avSlope,avTreeDen,Carbon,pH>=5,Clay)
#err!!!!
# What is happening here????
tree_select<-select(tree,-c(Site,PosTopoSeq,VegStructure))
tree_select<-select(tree,-(Site:VegStructure))
9. select()
#2.0.1 select and helper functions
# Keep variables or drop if negative sign (-)
select(tree, starts_with("av",ignore.case=T),starts_with("C"))
select(tree, ends_with("e"))
select(tree, contains("p"))
select(tree, matches("av"))
10. rename()
• To assign another name to the existing
variable
#2.1 #### rename
tree_rename<-rename(tree,Slope=avSlope)
tree_rename<-rename(tree,Slope=avSlope,TreeDen=avTreeDen)
12. mutate()
• add new columns that are functions of
existing columns.
#4.0 ### Mutate
tree_mute<-mutate(tree,Acidbase = 7-pH,clay.cover = Clay / avTreeDen)
#4.0.1 ### transmute
tree_mute<-transmute(tree,Acidbase = 7-pH,clay.cover = Clay / avTreeDen)
13. sample_n()
• use sample_n() and sample_frac() to take a
random sample of rows
#5.0 ### sample_n()
sample_n(tree, 10,replace=F)
#5.0.1 ### sample_frac()
sample_frac(tbl=tree, size=0.1)
14. summarise()
• Generate stats from the existing columns/variables.
Also generates by stats by grouping variable(s)
summarise(tree,
count = n(),
MeanCarb = mean(Carbon, na.rm = TRUE),
MeanClay = mean(Clay, na.rm = TRUE),
MedPh=median(pH,na.rm=T))
18. Update R
For windows OS
# installing/loading the package:
>if(!require(installr)) { install.packages("installr”)
>require(installr)} #load / install+load installr
# using the package:
>updateR() # this will start the updating process of your R installation.
Note: It will check for newer versions, and if one is available, will guide you
through the decisions you'd need to make.
19. Exercise
Use data you are working on and;
1. Manipulate using this the functions above
2. Explore more dplyr functions e.g, how to add row-wise,
column-wise e.t.c