SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Introduction to dplyr and base R functions for data manipulation
Kamal Gupta Roy
Last Edited on 3rd Nov 2021
Instructions/Agenda and Learnings
1. Use of functions like ls(), getwd(), setwd(), rm()
2. Install packages (dslabs, dplyr)
3. Load packages(dslabs, dplyr) – library
4. Read murder dataset
5. functions: nrow,ncol,head,tail,summary,class[try for dataframe and variable],str,names,levels,nlevels
6. Position of a dataframe
7. Reading a vector from data frame and doing basic arithmetic functions
8. Order/Arrange - Sorting the data
9. Selecting a column
10. Filtering rows
11. Creating a new variable
12. Summrizing data
13. Summarizing while grouping
14. Chaining Method
15. Exercise
dplyr functionality
• Five basic verbs: filter, select, arrange, mutate, summarise (plus group_by)
Basic Codes
Directory Details
#### workspace
ls()
1
## character(0)
#To know what is the default working directory
getwd()
## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 -
# Setting a Working Directory using setwd()
#setwd(C:/Users/Admin/)
getwd()
## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 -
Install packages
install.packages("dslabs")
install.packages("dplyr")
Load packages
library(dslabs)
library(dplyr)
##
## Attaching package: ’dplyr’
## The following objects are masked from ’package:stats’:
##
## filter, lag
## The following objects are masked from ’package:base’:
##
## intersect, setdiff, setequal, union
Read dataframe
murder <- data.frame(murders)
Basic check on data
nrow(murder)
## [1] 51
2
ncol(murder)
## [1] 5
head(murder)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
murder[1,1]
## [1] "Alabama"
tail(murder)
## state abb region population total
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
summary(murder)
## state abb region population
## Length:51 Length:51 Northeast : 9 Min. : 563626
## Class :character Class :character South :17 1st Qu.: 1696962
## Mode :character Mode :character North Central:12 Median : 4339367
## West :13 Mean : 6075769
## 3rd Qu.: 6636084
## Max. :37253956
## total
## Min. : 2.0
## 1st Qu.: 24.5
## Median : 97.0
## Mean : 184.4
## 3rd Qu.: 268.0
## Max. :1257.0
class(murder)
## [1] "data.frame"
3
class(murder$state)
## [1] "character"
str(murder)
## ’data.frame’: 51 obs. of 5 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
names(murder)
## [1] "state" "abb" "region" "population" "total"
levels(murder$region)
## [1] "Northeast" "South" "North Central" "West"
nlevels(murder$region)
## [1] 4
Read a vector from data frame
mdr <- murder$total
sum(mdr)
## [1] 9403
mean(mdr)
## [1] 184.3725
max(mdr)
## [1] 1257
min(mdr)
## [1] 2
4
dplyr functions
Sorting data
Simple R
rway <- murder[order(murder$total),]
head(rway)
## state abb region population total
## 46 Vermont VT Northeast 625741 2
## 35 North Dakota ND North Central 672591 4
## 30 New Hampshire NH Northeast 1316470 5
## 51 Wyoming WY West 563626 5
## 12 Hawaii HI West 1360301 7
## 42 South Dakota SD North Central 814180 8
dplyr
dpway <- arrange(murder, total)
head(dpway)
## state abb region population total
## 1 Vermont VT Northeast 625741 2
## 2 North Dakota ND North Central 672591 4
## 3 New Hampshire NH Northeast 1316470 5
## 4 Wyoming WY West 563626 5
## 5 Hawaii HI West 1360301 7
## 6 South Dakota SD North Central 814180 8
Selecting a column
Simple R
rway <- murder[,"state"]
head(rway)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
## [6] "Colorado"
class(rway)
## [1] "character"
5
rway <- murder[,c("state","total")]
head(rway)
## state total
## 1 Alabama 135
## 2 Alaska 19
## 3 Arizona 232
## 4 Arkansas 93
## 5 California 1257
## 6 Colorado 65
dplyr
dpway <- select(murder,state)
head(dpway)
## state
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
class(dpway)
## [1] "data.frame"
dpway <- select(murder,state,total)
head(dpway)
## state total
## 1 Alabama 135
## 2 Alaska 19
## 3 Arizona 232
## 4 Arkansas 93
## 5 California 1257
## 6 Colorado 65
Filtering rows
Simple R
rway <- murder[murder$state=='California',]
head(rway)
## state abb region population total
## 5 California CA West 37253956 1257
6
dplyr
dpway <- filter(murder,state=='California')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California' & abb=='CA')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California', abb=='CA')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California' | abb=='WI')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
## 2 Wisconsin WI North Central 5686986 97
dpway <- filter(murder,abb %in% c('CA','WI','NY'))
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
## 2 New York NY Northeast 19378102 517
## 3 Wisconsin WI North Central 5686986 97
Creating a new variable
Simple R
murder$newpop <- murder$population / 1000
head(murder)
## state abb region population total newpop
## 1 Alabama AL South 4779736 135 4779.736
## 2 Alaska AK West 710231 19 710.231
## 3 Arizona AZ West 6392017 232 6392.017
## 4 Arkansas AR South 2915918 93 2915.918
## 5 California CA West 37253956 1257 37253.956
## 6 Colorado CO West 5029196 65 5029.196
7
dplyr
dpway <- mutate(murder,newpop=population/1000)
head(dpway)
## state abb region population total newpop
## 1 Alabama AL South 4779736 135 4779.736
## 2 Alaska AK West 710231 19 710.231
## 3 Arizona AZ West 6392017 232 6392.017
## 4 Arkansas AR South 2915918 93 2915.918
## 5 California CA West 37253956 1257 37253.956
## 6 Colorado CO West 5029196 65 5029.196
summarise: Reduce variables to values
• Primarily useful with data that has been grouped by one or more variables
• group_by creates the groups that will be operated on
• summarise uses the provided aggregation function to summarise each group
dplyr way - summarize
summarise(murder,summurder=sum(total,na.rm=TRUE))
## summurder
## 1 9403
summarise(murder,avgmurder=mean(total,na.rm=TRUE))
## avgmurder
## 1 184.3725
summarise(murder,countrows=n())
## countrows
## 1 51
summarise(murder,summurder=sum(total,na.rm=TRUE),
avgmurder=mean(total,na.rm=TRUE),countrows=n())
## summurder avgmurder countrows
## 1 9403 184.3725 51
dplyr way - group by
8
m1 <- group_by(murder,region)
ab <- summarise(m1,md=sum(total, na.rm=TRUE),
pop = mean(population, na.rm=TRUE),
cn = n())
ab <- data.frame(ab)
ab
## region md pop cn
## 1 Northeast 1469 6146360 9
## 2 South 4195 6804378 17
## 3 North Central 1828 5577250 12
## 4 West 1911 5534273 13
Chaining Method
ab <- murder %>%
group_by(region) %>%
summarise(md = sum(total, na.rm=TRUE),
pop = sum(population, na.rm=TRUE),
cn=n())
ab <- data.frame(ab)
ab
## region md pop cn
## 1 Northeast 1469 55317240 9
## 2 South 4195 115674434 17
## 3 North Central 1828 66927001 12
## 4 West 1911 71945553 13
Exercises
Exercise 1
Do the following for Murder dataset
i. Get the murder dataset (as was done in the class)
ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the
data)
iii. which three states have highest population?
iv. How many states have more than average population?
v. what is the total population of US (actual number and in millions)
vi. what is the total number of murders across US?
vii. what is the average number of murders
viii. what is the total murders in the South region
9
ix. How many states are there in each region
x. what is the murder rate across each region?
xi. Which is the most dangerous state?
Exercise 2
Do the following for mtcars dataset
i. Get the mtcars dataset
ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the
data)
iii. How many different types of gears are there?
iv. which type of transmission is more? automatic or manual
v. what is the average hp by number of cylinders
vi. what is the avg hp by gears
vii. does mpg depend on number of gears?
viii. Does weight of car depends on number of cylinders?
10

Weitere ähnliche Inhalte

Was ist angesagt?

Ujian Matematik Tahun 3 Kertas 2
Ujian Matematik Tahun 3 Kertas 2Ujian Matematik Tahun 3 Kertas 2
Ujian Matematik Tahun 3 Kertas 2marshiza
 
Mini project boston housing dataset v1
Mini project   boston housing dataset v1Mini project   boston housing dataset v1
Mini project boston housing dataset v1Wyendrila Roy
 
Simplex method maximisation
Simplex method maximisationSimplex method maximisation
Simplex method maximisationAnurag Srivastava
 
9Johnson's Rule.ppt
9Johnson's Rule.ppt9Johnson's Rule.ppt
9Johnson's Rule.pptkhushboo561850
 
sum of subset problem using Backtracking
sum of subset problem using Backtrackingsum of subset problem using Backtracking
sum of subset problem using BacktrackingAbhishek Singh
 
Traveling salesman problem
Traveling salesman problemTraveling salesman problem
Traveling salesman problemJayesh Chauhan
 
Maximum sum subarray
Maximum sum subarrayMaximum sum subarray
Maximum sum subarrayShaheen kousar
 
Traveling salesman problem
Traveling salesman problemTraveling salesman problem
Traveling salesman problemJayesh Chauhan
 
9. transportation model
9. transportation model9. transportation model
9. transportation modelSudipta Saha
 
Ahp and anp
Ahp and anpAhp and anp
Ahp and anpMonaemKhan1
 
IRJET- Defect Detection in Fabric using Image Processing Technique
IRJET- Defect Detection in Fabric using Image Processing TechniqueIRJET- Defect Detection in Fabric using Image Processing Technique
IRJET- Defect Detection in Fabric using Image Processing TechniqueIRJET Journal
 
Kertas Soalan Matematik Tahun 4 Kertas 1 KSSR
Kertas Soalan Matematik Tahun 4 Kertas 1 KSSRKertas Soalan Matematik Tahun 4 Kertas 1 KSSR
Kertas Soalan Matematik Tahun 4 Kertas 1 KSSRar-rifke.com
 
MATHEMATICS YEAR 1 QUESTIONS
MATHEMATICS YEAR 1 QUESTIONSMATHEMATICS YEAR 1 QUESTIONS
MATHEMATICS YEAR 1 QUESTIONScikgumathavy
 
module6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfmodule6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfShiwani Gupta
 
The Maximum Subarray Problem
The Maximum Subarray ProblemThe Maximum Subarray Problem
The Maximum Subarray ProblemKamran Ashraf
 

Was ist angesagt? (20)

Ujian Matematik Tahun 3 Kertas 2
Ujian Matematik Tahun 3 Kertas 2Ujian Matematik Tahun 3 Kertas 2
Ujian Matematik Tahun 3 Kertas 2
 
Mini project boston housing dataset v1
Mini project   boston housing dataset v1Mini project   boston housing dataset v1
Mini project boston housing dataset v1
 
Assignment Problem
Assignment ProblemAssignment Problem
Assignment Problem
 
Traveling Salesman Problem
Traveling Salesman Problem Traveling Salesman Problem
Traveling Salesman Problem
 
Simplex method maximisation
Simplex method maximisationSimplex method maximisation
Simplex method maximisation
 
9Johnson's Rule.ppt
9Johnson's Rule.ppt9Johnson's Rule.ppt
9Johnson's Rule.ppt
 
sum of subset problem using Backtracking
sum of subset problem using Backtrackingsum of subset problem using Backtracking
sum of subset problem using Backtracking
 
Operation Research Techniques in Transportation
Operation Research Techniques in Transportation Operation Research Techniques in Transportation
Operation Research Techniques in Transportation
 
Traveling salesman problem
Traveling salesman problemTraveling salesman problem
Traveling salesman problem
 
Maximum sum subarray
Maximum sum subarrayMaximum sum subarray
Maximum sum subarray
 
Traveling salesman problem
Traveling salesman problemTraveling salesman problem
Traveling salesman problem
 
Final ppt-BSL
Final ppt-BSLFinal ppt-BSL
Final ppt-BSL
 
9. transportation model
9. transportation model9. transportation model
9. transportation model
 
Ahp and anp
Ahp and anpAhp and anp
Ahp and anp
 
IRJET- Defect Detection in Fabric using Image Processing Technique
IRJET- Defect Detection in Fabric using Image Processing TechniqueIRJET- Defect Detection in Fabric using Image Processing Technique
IRJET- Defect Detection in Fabric using Image Processing Technique
 
Kertas Soalan Matematik Tahun 4 Kertas 1 KSSR
Kertas Soalan Matematik Tahun 4 Kertas 1 KSSRKertas Soalan Matematik Tahun 4 Kertas 1 KSSR
Kertas Soalan Matematik Tahun 4 Kertas 1 KSSR
 
MATHEMATICS YEAR 1 QUESTIONS
MATHEMATICS YEAR 1 QUESTIONSMATHEMATICS YEAR 1 QUESTIONS
MATHEMATICS YEAR 1 QUESTIONS
 
Ch13pp
Ch13ppCh13pp
Ch13pp
 
module6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfmodule6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdf
 
The Maximum Subarray Problem
The Maximum Subarray ProblemThe Maximum Subarray Problem
The Maximum Subarray Problem
 

KĂĽrzlich hochgeladen

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

KĂĽrzlich hochgeladen (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Introduction to data manipulation in R

  • 1. Introduction to dplyr and base R functions for data manipulation Kamal Gupta Roy Last Edited on 3rd Nov 2021 Instructions/Agenda and Learnings 1. Use of functions like ls(), getwd(), setwd(), rm() 2. Install packages (dslabs, dplyr) 3. Load packages(dslabs, dplyr) – library 4. Read murder dataset 5. functions: nrow,ncol,head,tail,summary,class[try for dataframe and variable],str,names,levels,nlevels 6. Position of a dataframe 7. Reading a vector from data frame and doing basic arithmetic functions 8. Order/Arrange - Sorting the data 9. Selecting a column 10. Filtering rows 11. Creating a new variable 12. Summrizing data 13. Summarizing while grouping 14. Chaining Method 15. Exercise dplyr functionality • Five basic verbs: filter, select, arrange, mutate, summarise (plus group_by) Basic Codes Directory Details #### workspace ls() 1
  • 2. ## character(0) #To know what is the default working directory getwd() ## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 - # Setting a Working Directory using setwd() #setwd(C:/Users/Admin/) getwd() ## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 - Install packages install.packages("dslabs") install.packages("dplyr") Load packages library(dslabs) library(dplyr) ## ## Attaching package: ’dplyr’ ## The following objects are masked from ’package:stats’: ## ## filter, lag ## The following objects are masked from ’package:base’: ## ## intersect, setdiff, setequal, union Read dataframe murder <- data.frame(murders) Basic check on data nrow(murder) ## [1] 51 2
  • 3. ncol(murder) ## [1] 5 head(murder) ## state abb region population total ## 1 Alabama AL South 4779736 135 ## 2 Alaska AK West 710231 19 ## 3 Arizona AZ West 6392017 232 ## 4 Arkansas AR South 2915918 93 ## 5 California CA West 37253956 1257 ## 6 Colorado CO West 5029196 65 murder[1,1] ## [1] "Alabama" tail(murder) ## state abb region population total ## 46 Vermont VT Northeast 625741 2 ## 47 Virginia VA South 8001024 250 ## 48 Washington WA West 6724540 93 ## 49 West Virginia WV South 1852994 27 ## 50 Wisconsin WI North Central 5686986 97 ## 51 Wyoming WY West 563626 5 summary(murder) ## state abb region population ## Length:51 Length:51 Northeast : 9 Min. : 563626 ## Class :character Class :character South :17 1st Qu.: 1696962 ## Mode :character Mode :character North Central:12 Median : 4339367 ## West :13 Mean : 6075769 ## 3rd Qu.: 6636084 ## Max. :37253956 ## total ## Min. : 2.0 ## 1st Qu.: 24.5 ## Median : 97.0 ## Mean : 184.4 ## 3rd Qu.: 268.0 ## Max. :1257.0 class(murder) ## [1] "data.frame" 3
  • 4. class(murder$state) ## [1] "character" str(murder) ## ’data.frame’: 51 obs. of 5 variables: ## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... ## $ abb : chr "AL" "AK" "AZ" "AR" ... ## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ... ## $ population: num 4779736 710231 6392017 2915918 37253956 ... ## $ total : num 135 19 232 93 1257 ... names(murder) ## [1] "state" "abb" "region" "population" "total" levels(murder$region) ## [1] "Northeast" "South" "North Central" "West" nlevels(murder$region) ## [1] 4 Read a vector from data frame mdr <- murder$total sum(mdr) ## [1] 9403 mean(mdr) ## [1] 184.3725 max(mdr) ## [1] 1257 min(mdr) ## [1] 2 4
  • 5. dplyr functions Sorting data Simple R rway <- murder[order(murder$total),] head(rway) ## state abb region population total ## 46 Vermont VT Northeast 625741 2 ## 35 North Dakota ND North Central 672591 4 ## 30 New Hampshire NH Northeast 1316470 5 ## 51 Wyoming WY West 563626 5 ## 12 Hawaii HI West 1360301 7 ## 42 South Dakota SD North Central 814180 8 dplyr dpway <- arrange(murder, total) head(dpway) ## state abb region population total ## 1 Vermont VT Northeast 625741 2 ## 2 North Dakota ND North Central 672591 4 ## 3 New Hampshire NH Northeast 1316470 5 ## 4 Wyoming WY West 563626 5 ## 5 Hawaii HI West 1360301 7 ## 6 South Dakota SD North Central 814180 8 Selecting a column Simple R rway <- murder[,"state"] head(rway) ## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" ## [6] "Colorado" class(rway) ## [1] "character" 5
  • 6. rway <- murder[,c("state","total")] head(rway) ## state total ## 1 Alabama 135 ## 2 Alaska 19 ## 3 Arizona 232 ## 4 Arkansas 93 ## 5 California 1257 ## 6 Colorado 65 dplyr dpway <- select(murder,state) head(dpway) ## state ## 1 Alabama ## 2 Alaska ## 3 Arizona ## 4 Arkansas ## 5 California ## 6 Colorado class(dpway) ## [1] "data.frame" dpway <- select(murder,state,total) head(dpway) ## state total ## 1 Alabama 135 ## 2 Alaska 19 ## 3 Arizona 232 ## 4 Arkansas 93 ## 5 California 1257 ## 6 Colorado 65 Filtering rows Simple R rway <- murder[murder$state=='California',] head(rway) ## state abb region population total ## 5 California CA West 37253956 1257 6
  • 7. dplyr dpway <- filter(murder,state=='California') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 dpway <- filter(murder,state=='California' & abb=='CA') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 dpway <- filter(murder,state=='California', abb=='CA') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 dpway <- filter(murder,state=='California' | abb=='WI') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 ## 2 Wisconsin WI North Central 5686986 97 dpway <- filter(murder,abb %in% c('CA','WI','NY')) head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 ## 2 New York NY Northeast 19378102 517 ## 3 Wisconsin WI North Central 5686986 97 Creating a new variable Simple R murder$newpop <- murder$population / 1000 head(murder) ## state abb region population total newpop ## 1 Alabama AL South 4779736 135 4779.736 ## 2 Alaska AK West 710231 19 710.231 ## 3 Arizona AZ West 6392017 232 6392.017 ## 4 Arkansas AR South 2915918 93 2915.918 ## 5 California CA West 37253956 1257 37253.956 ## 6 Colorado CO West 5029196 65 5029.196 7
  • 8. dplyr dpway <- mutate(murder,newpop=population/1000) head(dpway) ## state abb region population total newpop ## 1 Alabama AL South 4779736 135 4779.736 ## 2 Alaska AK West 710231 19 710.231 ## 3 Arizona AZ West 6392017 232 6392.017 ## 4 Arkansas AR South 2915918 93 2915.918 ## 5 California CA West 37253956 1257 37253.956 ## 6 Colorado CO West 5029196 65 5029.196 summarise: Reduce variables to values • Primarily useful with data that has been grouped by one or more variables • group_by creates the groups that will be operated on • summarise uses the provided aggregation function to summarise each group dplyr way - summarize summarise(murder,summurder=sum(total,na.rm=TRUE)) ## summurder ## 1 9403 summarise(murder,avgmurder=mean(total,na.rm=TRUE)) ## avgmurder ## 1 184.3725 summarise(murder,countrows=n()) ## countrows ## 1 51 summarise(murder,summurder=sum(total,na.rm=TRUE), avgmurder=mean(total,na.rm=TRUE),countrows=n()) ## summurder avgmurder countrows ## 1 9403 184.3725 51 dplyr way - group by 8
  • 9. m1 <- group_by(murder,region) ab <- summarise(m1,md=sum(total, na.rm=TRUE), pop = mean(population, na.rm=TRUE), cn = n()) ab <- data.frame(ab) ab ## region md pop cn ## 1 Northeast 1469 6146360 9 ## 2 South 4195 6804378 17 ## 3 North Central 1828 5577250 12 ## 4 West 1911 5534273 13 Chaining Method ab <- murder %>% group_by(region) %>% summarise(md = sum(total, na.rm=TRUE), pop = sum(population, na.rm=TRUE), cn=n()) ab <- data.frame(ab) ab ## region md pop cn ## 1 Northeast 1469 55317240 9 ## 2 South 4195 115674434 17 ## 3 North Central 1828 66927001 12 ## 4 West 1911 71945553 13 Exercises Exercise 1 Do the following for Murder dataset i. Get the murder dataset (as was done in the class) ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the data) iii. which three states have highest population? iv. How many states have more than average population? v. what is the total population of US (actual number and in millions) vi. what is the total number of murders across US? vii. what is the average number of murders viii. what is the total murders in the South region 9
  • 10. ix. How many states are there in each region x. what is the murder rate across each region? xi. Which is the most dangerous state? Exercise 2 Do the following for mtcars dataset i. Get the mtcars dataset ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the data) iii. How many different types of gears are there? iv. which type of transmission is more? automatic or manual v. what is the average hp by number of cylinders vi. what is the avg hp by gears vii. does mpg depend on number of gears? viii. Does weight of car depends on number of cylinders? 10