R seminar dplyr package

R SEMINAR
Antony Karanja N.
Research Methods Group, ICRAF
2nd April, 15
Data Management and Analysis

AIM
• Recap on the steps and tips to R learning to
code
• Introduction to dplyr package
• How to utilize dplyr package for data
manipulation* and basic statistics
• Ultimate: dplyr and ggplot2

RECAP
• Set working directory (creating project, setwd)
• Installing and calling library packages
• Reading/loading data (read.???)
• What is the R object type (class)
• Variables within data frames
• Knowing which Data type are the variables
• View head and tail data

RECAP###################
# IMPORT datasets #
###################
tree<-read.csv(file="datavis.csv",header=T)
#-------------------------
# Inspect data with head()
#-------------------------
names(tree);colnames(tree)
head(tree)
tail(tree)
#-------------------------
# Inspect R object type
#-------------------------
class(tree)
#-------------------------
# Inspect Internal structure of R object type
#-------------------------
str(tree)
glimpse(tree)
#-------------------------
# Inspect data types
#-------------------------
sapply(tree,class) #-horizontal view
lapply(tree,class) #-Vertical view
##############################
# LOOK FOR DUPLICATE RECORDS #
##############################
duplicates<-tree[anyDuplicated(tree[c("Country","Site","PosTopoSeq")]),] #Base function

dplyr
• #install.packages(“dplyr”)
• >library(dplyr)
• Grammar of data manipulations
– filter() (and slice())
– arrange()
– select() (and rename())
– distinct()
– mutate() (and transmute())
– summarise()
– sample_n() and sample_frac()

filter()
• filter() allows you to select a subset of the rows of a
data frame.
• filter() works similarly to subset()
• Filter(FD, condition(s))
#1.0 #### filter - By and (use comma) or use |
table(tree$Country)
Nicaragua<-filter(tree, Country == "Nicaragua")
SA<-filter(tree, Country == "South Africa")
#1.1 #### slice
Nicaragua2<-slice(tree, 1:16)

arrange()
• arrange() works similarly to filter() except that
instead of filtering or selecting rows, it reorders
them.
#2.0 #### arrange
arrange(tree, Site,PosTopoSeq,VegStructure)
tree_arr<-arrange(tree, Site,PosTopoSeq,VegStructure)
tree_arr<-arrange(tree, desc(Site),PosTopoSeq,VegStructure)

select()
• Very helpful when working with dataset with many
columns/variables
• Helper function within select() include starts_with(),
ends_with(), matches() and contains()
#2.0 #### select
tree_select<-select(tree,Country,SEVEREERO,avSlope,avTreeDen,Carbon,pH,Clay)
tree_select<-select(tree,Country,SEVEREERO,avSlope,avTreeDen,Carbon,pH>=5,Clay)
#err!!!!
# What is happening here????
tree_select<-select(tree,-c(Site,PosTopoSeq,VegStructure))
tree_select<-select(tree,-(Site:VegStructure))

select()
#2.0.1 select and helper functions
# Keep variables or drop if negative sign (-)
select(tree, starts_with("av",ignore.case=T),starts_with("C"))
select(tree, ends_with("e"))
select(tree, contains("p"))
select(tree, matches("av"))

rename()
• To assign another name to the existing
variable
#2.1 #### rename
tree_rename<-rename(tree,Slope=avSlope)
tree_rename<-rename(tree,Slope=avSlope,TreeDen=avTreeDen)

distinct()
• Extract distinct (unique) rows
#3.0 ### distinct
tree_distinct<-distinct(tree)
tree_distinct<-distinct(select(tree,Country,Site,PosTopoSeq))

mutate()
• add new columns that are functions of
existing columns.
#4.0 ### Mutate
tree_mute<-mutate(tree,Acidbase = 7-pH,clay.cover = Clay / avTreeDen)
#4.0.1 ### transmute
tree_mute<-transmute(tree,Acidbase = 7-pH,clay.cover = Clay / avTreeDen)

sample_n()
• use sample_n() and sample_frac() to take a
random sample of rows
#5.0 ### sample_n()
sample_n(tree, 10,replace=F)
#5.0.1 ### sample_frac()
sample_frac(tbl=tree, size=0.1)

summarise()
• Generate stats from the existing columns/variables.
Also generates by stats by grouping variable(s)
summarise(tree,
count = n(),
MeanCarb = mean(Carbon, na.rm = TRUE),
MeanClay = mean(Clay, na.rm = TRUE),
MedPh=median(pH,na.rm=T))

summarise()
• Stats by grouping variable(s)
tree.summary <- tree %>%
group_by(Country,Site,SEVEREERO) %>%
summarise(count = n(),
meanC = mean(Carbon,na.rm=T),
meanClay = mean(Clay,na.rm=T),
sdC=sd(Carbon,na.rm=T),
sdClay=sd(Clay,na.rm=T),
medPh=median(pH,na.rm=T))

R Version
>R.Version()$version.string
OR
>R.version.string
BONUS

Update R
For windows OS
# installing/loading the package:
>if(!require(installr)) { install.packages("installr”)
>require(installr)} #load / install+load installr
# using the package:
>updateR() # this will start the updating process of your R installation.
Note: It will check for newer versions, and if one is available, will guide you
through the decisions you'd need to make.

Exercise
Use data you are working on and;
1. Manipulate using this the functions above
2. Explore more dplyr functions e.g, how to add row-wise,
column-wise e.t.c

R seminar dplyr package

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie R seminar dplyr package

Ähnlich wie R seminar dplyr package (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

R seminar dplyr package