SlideShare ist ein Scribd-Unternehmen logo
1 von 2
Downloaden Sie, um offline zu lesen
Group Cases
group_by(.data, ..., add = FALSE)
Returns copy of table grouped by …
g_iris <- group_by(iris, Species)
ungroup(x, ...)
Returns ungrouped copy of table.
ungroup(g_iris)
wwwwww
www
Use group_by() to created a "grouped" copy of a table. dplyr
functions will manipulate each "group" separately and then
combine the results.
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg))
Summarise Cases
These apply summary functions to
columns to create a new table.
Summary functions take vectors as
input and return one value (see back).
summary
function
Variations
• summarise_all() - Apply funs to every column.
• summarise_at() - Apply funs to specific columns.
• summarise_if() - Apply funs to all cols of one type.
www
www
summarise(.data, …)
Compute table of summaries. Also
summarise_().
summarise(mtcars, avg = mean(mpg))
count(x, ..., wt = NULL, sort = FALSE)
Count number of rows in each group defined
by the variables in … Also tally().
count(iris, Species)
Manipulate Cases
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01
Extract Cases
Add Cases
Arrange Cases
filter(.data, …)
Extract rows that meet logical criteria. Also
filter_(). filter(iris, Sepal.Length > 7)
distinct(.data, ..., .keep_all = FALSE)
Remove rows with duplicate values. Also
distinct_(). distinct(iris, Species)
sample_frac(tbl, size = 1, replace = FALSE,
weight = NULL, .env = parent.frame())
Randomly select fraction of rows.
sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE,
weight = NULL, .env = parent.frame())
Randomly select size rows.
sample_n(iris, 10, replace = TRUE)
slice(.data, …)
Select rows by position. Also slice_().
slice(iris, 10:15)
top_n(x, n, wt)
Select and order top n entries (by group if
grouped data). top_n(iris, 5, Sepal.Width)
Row functions return a subset of rows as a new table. Use a variant
that ends in _ for non-standard evaluation friendly code.
wwwwww
wwwwww
wwwwww
wwwwww
Logical and boolean operators to use with filter()
See ?base::logic and ?Comparison for help.
> >= !is.na() ! &
< <= is.na() %in% | xor()
wwwwww
arrange(.data, ...)
Order rows by values of a column (low to high),
use with desc() to order from high to low.
arrange(mtcars, mpg)
arrange(mtcars, desc(mpg))
wwwwww
add_row(.data, ..., .before = NULL,
.after = NULL)
Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1)
Manipulate Variables
Extract Variables
Make New Variables
wwww
www ww
Column functions return a set of columns as a new table. Use a
variant that ends in _ for non-standard evaluation friendly code.
vectorized
function
These apply vectorized functions to
columns. Vectorized funs take vectors
as input and return vectors of the
same length as output (see back).
Data Transformation
with dplyr Cheat Sheet
wwwwww
www
wwww
mutate(.data, …)
Compute new column(s).
mutate(mtcars, gpm = 1/mpg)
transmute(.data, …)
Compute new column(s), drop others.
transmute(mtcars, gpm = 1/mpg)
mutate_all(.tbl, .funs, ...)
Apply funs to every column. Use with
funs(). mutate_all(faithful, funs(log(.),
log2(.)))
mutate_at(.tbl, .cols, .funs, ...)
Apply funs to specific columns. Use with
funs(), vars() and the helper functions for
select().
mutate_at(iris, vars( -Species), funs(log(.)))
mutate_if(.tbl, .predicate, .funs, ...)
Apply funs to all columns of one type. Use
with funs().
mutate_if(iris, is.numeric, funs(log(.)))
add_column(.data, ..., .before =
NULL, .after = NULL)
Add new column(s).
add_column(mtcars, new = 1:32)
rename(.data, …)
Rename columns.
rename(iris, Length = Sepal.Length)
w ww
Use these helpers with select(),
e.g. select(iris, starts_with("Sepal"))
contains(match)
ends_with(match)
matches(match)
:, e.g. mpg:cyl
-, e.g, -Species
Each observation, or
case, is in its own row
A B C
Each variable is
in its own column
A B C
&
dplyr functions work with pipes and expect tidy data. In tidy data:
pipes
x %>% f(y)
becomes f(x, y) num_range(prefix, range)
one_of(…)
starts_with(match)
select(.data, …)
Extract columns by name. Also select_if()
select(iris, Sepal.Length, Species)
wwwwww
C A B
1 a t
2 b u
3 c v
1 a t
2 b u
3 c v
C A B
A B C
a t 1
b u 2
c v 3
1 a t
2 b u
3 c v
C A B
A.x B.x C A.y B.y
a t 1 d w
b u 2 b u
c v 3 a t
a t 1 d w
b u 2 b u
c v 3 a t
A1 B1 C A2 B2
A B.x C B.y D
a t 1 t 3
b u 2 u 2
c v 3 NA NA
A B D
a t 3
b u 2
d w 1
A B C D
a t 1 3
b u 2 2
c v 3 NA
d w NA 1
A B C D
a t 1 3
b u 2 2
a t 1 3
b u 2 2
d w NA 1
A B C D
A B C D
a t 1 3
b u 2 2
c v 3 NA
A B C A B D
a t 1 a t 3
b u 2 b u 2
c v 3 d w 1
A B C
c v 3
A B C
a t 1
b u 2
A B C
a t 1
b u 2
a t 1
b u 2
A B C
c v 3
d w 4
A B C
c v 3
DF A B C
x a t 1
x b u 2
x c v 3
z c v 3
z d w 4
Counts
dplyr::n() - number of values/rows
dplyr::n_distinct() - # of uniques
sum(!is.na()) - # of non-NA’s
Location
mean() - mean, also mean(!is.na())
median() - median
Logicals
mean() - Proportion of TRUE’s
sum() - # of TRUE’s
Position/Order
dplyr::first() - first value
dplyr::last() - last value
dplyr::nth() - value in nth location of vector
Rank
quantile() - nth quantile
min() - minimum value
max() - maximum value
Spread
IQR() - Inter-Quartile Range
mad() - mean absolute deviation
sd() - standard deviation
var() - variance
Offsets
dplyr::lag() - Offset elements by 1
dplyr::lead() - Offset elements by -1
Cumulative Aggregates
dplyr::cumall() - Cumulative all()
dplyr::cumany() - Cumulative any()
cummax() - Cumulative max()
dplyr::cummean() - Cumulative mean()
cummin() - Cumulative min()
cumprod() - Cumulative prod()
cumsum() - Cumulative sum()
Rankings
dplyr::cume_dist() - Proportion of all values <=
dplyr::dense_rank() - rank with ties = min, no
gaps
dplyr::min_rank() - rank with ties = min
dplyr::ntile() - bins into n bins
dplyr::percent_rank() - min_rank scaled to [0,1]
dplyr::row_number() - rank with ties = "first"
Math
+, - , *, /, ^, %/%, %% - arithmetic ops
log(), log2(), log10() - logs
<, <=, >, >=, !=, == - logical comparisons
Misc
dplyr::between() - x >= left & x <= right
dplyr::case_when() - multi-case if_else()
dplyr::coalesce() - first non-NA values by
element across a set of vectors
dplyr::if_else() - element-wise if() + else()
dplyr::na_if() - replace specific values with NA
pmax() - element-wise max()
pmin() - element-wise min()
dplyr::recode() - Vectorized switch()
dplyr::recode_factor() - Vectorized switch() for
factors
Summary FunctionsVectorized Functions Combine Tables
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01
Combine Variables
bind_cols(…)
Returns tables placed side by
side as a single table.
BE SURE THAT ROWS ALIGN.
left_join(x, y, by = NULL,
copy=FALSE, suffix=c(“.x”,“.y”),…)
Join matching values from y to x.
right_join(x, y, by = NULL, copy =
FALSE, suffix=c(“.x”,“.y”),…)
Join matching values from x to y.
inner_join(x, y, by = NULL, copy =
FALSE, suffix=c(“.x”,“.y”),…)
Join data. Retain only rows with
matches.
full_join(x, y, by = NULL,
copy=FALSE, suffix=c(“.x”,“.y”),…)
Join data. Retain all values, all
rows.
Use a "Mutating Join" to join one table to columns
from another, matching values with the rows that
they correspond to. Each join retains a different
combination of values from the tables.
Use by = c("col1", "col2") to
specify the column(s) to match
on.
left_join(x, y, by = "A")
Use a named vector, by =
c("col1" = "col2"), to match on
columns with different names in
each data set.
left_join(x, y, by = c("C" = "D"))
Use suffix to specify suffix to give
to duplicate column names.
left_join(x, y, by = c("C" = "D"),
suffix = c("1", "2"))
Use bind_cols() to paste tables beside each other
as they are.
A B C
a t 1
b u 2
c v 3
+ =
x y
A B D
a t 3
b u 2
d w 1
Combine Cases
A B C
a t 1
b u 2
Use bind_rows() to paste tables below each other as
they are.
bind_rows(…, .id = NULL)
Returns tables one on top of the other
as a single table. Set .id to a column
name to add a column of the original
table names (as pictured)
intersect(x, y, …)
Rows that appear in both x and z.
setdiff(x, y, …)
Rows that appear in x but not z.
union(x, y, …)
Rows that appear in x or z. (Duplicates
removed). union_all() retains
duplicates.
Extract Rows
Use a "Filtering Join" to filter one table against the
rows of another.
semi_join(x, y, by = NULL, …)
Return rows of x that have a match in y.
USEFUL TO SEE WHAT WILL BE JOINED.
anti_join(x, y, by = NULL, …)
Return rows of x that do not have a
match in y. USEFUL TO SEE WHAT WILL
NOT BE JOINED.
A B C
a t 1
b u 2
c v 3
+
x
z
A B C
c v 3
d w 4
Use setequal() to test whether two data sets contain
the exact same rows (in any order).
A B C
a t 1
b u 2
c v 3
+ =
x y
A B D
a t 3
b u 2
d w 1
to use with summarise()to use with mutate()
mutate() and transmute() apply vectorized
functions to columns to create new columns.
Vectorized functions take vectors as input and
return vectors of the same length as output.
vectorized
function
summarise() applies summary functions to
columns to create a new table. Summary
functions take vectors as input and return single
values as output.
summary
function
Row names
Tidy data does not use rownames, which store
a variable outside of the columns. To work with
the rownames, first move them into a column.
rownames_to_column()
Move row names into col.
a <- rownames_to_column(iris,
var = "C")
column_to_rownames()
Move col in row names.
column_to_rownames(a,
var = "C")
Also has_rownames(), remove_rownames()
a t 1
b u 2
A B C
a t 1
b u 2
c v 3
A B C A B C
a t 3
b u 2
d w 1
a t 1
b u 2
c v 3
A B CA B C
a t 1
b u 2
c v 3
+ =
x y
A B D
a t 3
b u 2
d w 1
A B C
a t 1
b u 2
c v 3 + =
x y
A B D
a t 3
b u 2
d w 1

Weitere ähnliche Inhalte

Was ist angesagt?

Hashing in datastructure
Hashing in datastructureHashing in datastructure
Hashing in datastructurerajshreemuthiah
 
Function arguments In Python
Function arguments In PythonFunction arguments In Python
Function arguments In PythonAmit Upadhyay
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
List,tuple,dictionary
List,tuple,dictionaryList,tuple,dictionary
List,tuple,dictionarynitamhaske
 
First Order Predicate Logic Examples| Different Examples
First Order Predicate Logic Examples| Different ExamplesFirst Order Predicate Logic Examples| Different Examples
First Order Predicate Logic Examples| Different ExamplesHabibaSaeed5
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubMartin Bago
 
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...Edureka!
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
Linear and Binary search
Linear and Binary searchLinear and Binary search
Linear and Binary searchNisha Soms
 

Was ist angesagt? (20)

Hashing in datastructure
Hashing in datastructureHashing in datastructure
Hashing in datastructure
 
Function arguments In Python
Function arguments In PythonFunction arguments In Python
Function arguments In Python
 
Python list
Python listPython list
Python list
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Comandos ddl
Comandos ddlComandos ddl
Comandos ddl
 
List,tuple,dictionary
List,tuple,dictionaryList,tuple,dictionary
List,tuple,dictionary
 
Database management systems
Database management systemsDatabase management systems
Database management systems
 
Sets in python
Sets in pythonSets in python
Sets in python
 
First Order Predicate Logic Examples| Different Examples
First Order Predicate Logic Examples| Different ExamplesFirst Order Predicate Logic Examples| Different Examples
First Order Predicate Logic Examples| Different Examples
 
Python list
Python listPython list
Python list
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
Python programming : List and tuples
Python programming : List and tuplesPython programming : List and tuples
Python programming : List and tuples
 
Python my sql database connection
Python my sql   database connectionPython my sql   database connection
Python my sql database connection
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Linear and Binary search
Linear and Binary searchLinear and Binary search
Linear and Binary search
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 

Ähnlich wie Data transformation-cheatsheet

Ähnlich wie Data transformation-cheatsheet (20)

Matlab tut3
Matlab tut3Matlab tut3
Matlab tut3
 
Commands list
Commands listCommands list
Commands list
 
Pandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheetPandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheet
 
tidyr.pdf
tidyr.pdftidyr.pdf
tidyr.pdf
 
Matlab cheatsheet
Matlab cheatsheetMatlab cheatsheet
Matlab cheatsheet
 
Statistics lab 1
Statistics lab 1Statistics lab 1
Statistics lab 1
 
ML-CheatSheet (1).pdf
ML-CheatSheet (1).pdfML-CheatSheet (1).pdf
ML-CheatSheet (1).pdf
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
 
LMD UGR
LMD UGRLMD UGR
LMD UGR
 
R gráfico
R gráficoR gráfico
R gráfico
 
Reference card for R
Reference card for RReference card for R
Reference card for R
 
Short Reference Card for R users.
Short Reference Card for R users.Short Reference Card for R users.
Short Reference Card for R users.
 
R command cheatsheet.pdf
R command cheatsheet.pdfR command cheatsheet.pdf
R command cheatsheet.pdf
 
@ R reference
@ R reference@ R reference
@ R reference
 
MATLAB-Cheat-Sheet-for-Data-Science_LondonSchoolofEconomics (1).pdf
MATLAB-Cheat-Sheet-for-Data-Science_LondonSchoolofEconomics (1).pdfMATLAB-Cheat-Sheet-for-Data-Science_LondonSchoolofEconomics (1).pdf
MATLAB-Cheat-Sheet-for-Data-Science_LondonSchoolofEconomics (1).pdf
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R Programming Reference Card
R Programming Reference CardR Programming Reference Card
R Programming Reference Card
 
Data import-cheatsheet
Data import-cheatsheetData import-cheatsheet
Data import-cheatsheet
 
3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
 
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)
 

Mehr von Dieudonne Nahigombeye (10)

Sparklyr
SparklyrSparklyr
Sparklyr
 
Rstudio ide-cheatsheet
Rstudio ide-cheatsheetRstudio ide-cheatsheet
Rstudio ide-cheatsheet
 
Rmarkdown cheatsheet-2.0
Rmarkdown cheatsheet-2.0Rmarkdown cheatsheet-2.0
Rmarkdown cheatsheet-2.0
 
Reg ex cheatsheet
Reg ex cheatsheetReg ex cheatsheet
Reg ex cheatsheet
 
How big-is-your-graph
How big-is-your-graphHow big-is-your-graph
How big-is-your-graph
 
Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1
 
Eurostat cheatsheet
Eurostat cheatsheetEurostat cheatsheet
Eurostat cheatsheet
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
Base r
Base rBase r
Base r
 
Advanced r
Advanced rAdvanced r
Advanced r
 

Kürzlich hochgeladen

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 

Kürzlich hochgeladen (20)

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 

Data transformation-cheatsheet

  • 1. Group Cases group_by(.data, ..., add = FALSE) Returns copy of table grouped by … g_iris <- group_by(iris, Species) ungroup(x, ...) Returns ungrouped copy of table. ungroup(g_iris) wwwwww www Use group_by() to created a "grouped" copy of a table. dplyr functions will manipulate each "group" separately and then combine the results. mtcars %>% group_by(cyl) %>% summarise(avg = mean(mpg)) Summarise Cases These apply summary functions to columns to create a new table. Summary functions take vectors as input and return one value (see back). summary function Variations • summarise_all() - Apply funs to every column. • summarise_at() - Apply funs to specific columns. • summarise_if() - Apply funs to all cols of one type. www www summarise(.data, …) Compute table of summaries. Also summarise_(). summarise(mtcars, avg = mean(mpg)) count(x, ..., wt = NULL, sort = FALSE) Count number of rows in each group defined by the variables in … Also tally(). count(iris, Species) Manipulate Cases RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01 Extract Cases Add Cases Arrange Cases filter(.data, …) Extract rows that meet logical criteria. Also filter_(). filter(iris, Sepal.Length > 7) distinct(.data, ..., .keep_all = FALSE) Remove rows with duplicate values. Also distinct_(). distinct(iris, Species) sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = parent.frame()) Randomly select fraction of rows. sample_frac(iris, 0.5, replace = TRUE) sample_n(tbl, size, replace = FALSE, weight = NULL, .env = parent.frame()) Randomly select size rows. sample_n(iris, 10, replace = TRUE) slice(.data, …) Select rows by position. Also slice_(). slice(iris, 10:15) top_n(x, n, wt) Select and order top n entries (by group if grouped data). top_n(iris, 5, Sepal.Width) Row functions return a subset of rows as a new table. Use a variant that ends in _ for non-standard evaluation friendly code. wwwwww wwwwww wwwwww wwwwww Logical and boolean operators to use with filter() See ?base::logic and ?Comparison for help. > >= !is.na() ! & < <= is.na() %in% | xor() wwwwww arrange(.data, ...) Order rows by values of a column (low to high), use with desc() to order from high to low. arrange(mtcars, mpg) arrange(mtcars, desc(mpg)) wwwwww add_row(.data, ..., .before = NULL, .after = NULL) Add one or more rows to a table. add_row(faithful, eruptions = 1, waiting = 1) Manipulate Variables Extract Variables Make New Variables wwww www ww Column functions return a set of columns as a new table. Use a variant that ends in _ for non-standard evaluation friendly code. vectorized function These apply vectorized functions to columns. Vectorized funs take vectors as input and return vectors of the same length as output (see back). Data Transformation with dplyr Cheat Sheet wwwwww www wwww mutate(.data, …) Compute new column(s). mutate(mtcars, gpm = 1/mpg) transmute(.data, …) Compute new column(s), drop others. transmute(mtcars, gpm = 1/mpg) mutate_all(.tbl, .funs, ...) Apply funs to every column. Use with funs(). mutate_all(faithful, funs(log(.), log2(.))) mutate_at(.tbl, .cols, .funs, ...) Apply funs to specific columns. Use with funs(), vars() and the helper functions for select(). mutate_at(iris, vars( -Species), funs(log(.))) mutate_if(.tbl, .predicate, .funs, ...) Apply funs to all columns of one type. Use with funs(). mutate_if(iris, is.numeric, funs(log(.))) add_column(.data, ..., .before = NULL, .after = NULL) Add new column(s). add_column(mtcars, new = 1:32) rename(.data, …) Rename columns. rename(iris, Length = Sepal.Length) w ww Use these helpers with select(), e.g. select(iris, starts_with("Sepal")) contains(match) ends_with(match) matches(match) :, e.g. mpg:cyl -, e.g, -Species Each observation, or case, is in its own row A B C Each variable is in its own column A B C & dplyr functions work with pipes and expect tidy data. In tidy data: pipes x %>% f(y) becomes f(x, y) num_range(prefix, range) one_of(…) starts_with(match) select(.data, …) Extract columns by name. Also select_if() select(iris, Sepal.Length, Species) wwwwww
  • 2. C A B 1 a t 2 b u 3 c v 1 a t 2 b u 3 c v C A B A B C a t 1 b u 2 c v 3 1 a t 2 b u 3 c v C A B A.x B.x C A.y B.y a t 1 d w b u 2 b u c v 3 a t a t 1 d w b u 2 b u c v 3 a t A1 B1 C A2 B2 A B.x C B.y D a t 1 t 3 b u 2 u 2 c v 3 NA NA A B D a t 3 b u 2 d w 1 A B C D a t 1 3 b u 2 2 c v 3 NA d w NA 1 A B C D a t 1 3 b u 2 2 a t 1 3 b u 2 2 d w NA 1 A B C D A B C D a t 1 3 b u 2 2 c v 3 NA A B C A B D a t 1 a t 3 b u 2 b u 2 c v 3 d w 1 A B C c v 3 A B C a t 1 b u 2 A B C a t 1 b u 2 a t 1 b u 2 A B C c v 3 d w 4 A B C c v 3 DF A B C x a t 1 x b u 2 x c v 3 z c v 3 z d w 4 Counts dplyr::n() - number of values/rows dplyr::n_distinct() - # of uniques sum(!is.na()) - # of non-NA’s Location mean() - mean, also mean(!is.na()) median() - median Logicals mean() - Proportion of TRUE’s sum() - # of TRUE’s Position/Order dplyr::first() - first value dplyr::last() - last value dplyr::nth() - value in nth location of vector Rank quantile() - nth quantile min() - minimum value max() - maximum value Spread IQR() - Inter-Quartile Range mad() - mean absolute deviation sd() - standard deviation var() - variance Offsets dplyr::lag() - Offset elements by 1 dplyr::lead() - Offset elements by -1 Cumulative Aggregates dplyr::cumall() - Cumulative all() dplyr::cumany() - Cumulative any() cummax() - Cumulative max() dplyr::cummean() - Cumulative mean() cummin() - Cumulative min() cumprod() - Cumulative prod() cumsum() - Cumulative sum() Rankings dplyr::cume_dist() - Proportion of all values <= dplyr::dense_rank() - rank with ties = min, no gaps dplyr::min_rank() - rank with ties = min dplyr::ntile() - bins into n bins dplyr::percent_rank() - min_rank scaled to [0,1] dplyr::row_number() - rank with ties = "first" Math +, - , *, /, ^, %/%, %% - arithmetic ops log(), log2(), log10() - logs <, <=, >, >=, !=, == - logical comparisons Misc dplyr::between() - x >= left & x <= right dplyr::case_when() - multi-case if_else() dplyr::coalesce() - first non-NA values by element across a set of vectors dplyr::if_else() - element-wise if() + else() dplyr::na_if() - replace specific values with NA pmax() - element-wise max() pmin() - element-wise min() dplyr::recode() - Vectorized switch() dplyr::recode_factor() - Vectorized switch() for factors Summary FunctionsVectorized Functions Combine Tables RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01 Combine Variables bind_cols(…) Returns tables placed side by side as a single table. BE SURE THAT ROWS ALIGN. left_join(x, y, by = NULL, copy=FALSE, suffix=c(“.x”,“.y”),…) Join matching values from y to x. right_join(x, y, by = NULL, copy = FALSE, suffix=c(“.x”,“.y”),…) Join matching values from x to y. inner_join(x, y, by = NULL, copy = FALSE, suffix=c(“.x”,“.y”),…) Join data. Retain only rows with matches. full_join(x, y, by = NULL, copy=FALSE, suffix=c(“.x”,“.y”),…) Join data. Retain all values, all rows. Use a "Mutating Join" to join one table to columns from another, matching values with the rows that they correspond to. Each join retains a different combination of values from the tables. Use by = c("col1", "col2") to specify the column(s) to match on. left_join(x, y, by = "A") Use a named vector, by = c("col1" = "col2"), to match on columns with different names in each data set. left_join(x, y, by = c("C" = "D")) Use suffix to specify suffix to give to duplicate column names. left_join(x, y, by = c("C" = "D"), suffix = c("1", "2")) Use bind_cols() to paste tables beside each other as they are. A B C a t 1 b u 2 c v 3 + = x y A B D a t 3 b u 2 d w 1 Combine Cases A B C a t 1 b u 2 Use bind_rows() to paste tables below each other as they are. bind_rows(…, .id = NULL) Returns tables one on top of the other as a single table. Set .id to a column name to add a column of the original table names (as pictured) intersect(x, y, …) Rows that appear in both x and z. setdiff(x, y, …) Rows that appear in x but not z. union(x, y, …) Rows that appear in x or z. (Duplicates removed). union_all() retains duplicates. Extract Rows Use a "Filtering Join" to filter one table against the rows of another. semi_join(x, y, by = NULL, …) Return rows of x that have a match in y. USEFUL TO SEE WHAT WILL BE JOINED. anti_join(x, y, by = NULL, …) Return rows of x that do not have a match in y. USEFUL TO SEE WHAT WILL NOT BE JOINED. A B C a t 1 b u 2 c v 3 + x z A B C c v 3 d w 4 Use setequal() to test whether two data sets contain the exact same rows (in any order). A B C a t 1 b u 2 c v 3 + = x y A B D a t 3 b u 2 d w 1 to use with summarise()to use with mutate() mutate() and transmute() apply vectorized functions to columns to create new columns. Vectorized functions take vectors as input and return vectors of the same length as output. vectorized function summarise() applies summary functions to columns to create a new table. Summary functions take vectors as input and return single values as output. summary function Row names Tidy data does not use rownames, which store a variable outside of the columns. To work with the rownames, first move them into a column. rownames_to_column() Move row names into col. a <- rownames_to_column(iris, var = "C") column_to_rownames() Move col in row names. column_to_rownames(a, var = "C") Also has_rownames(), remove_rownames() a t 1 b u 2 A B C a t 1 b u 2 c v 3 A B C A B C a t 3 b u 2 d w 1 a t 1 b u 2 c v 3 A B CA B C a t 1 b u 2 c v 3 + = x y A B D a t 3 b u 2 d w 1 A B C a t 1 b u 2 c v 3 + = x y A B D a t 3 b u 2 d w 1