SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Outline (개요)
• Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
• Variation (다양성, 분산): Univariate distributions (일변량 분포)
• Categorical (범주형) variable (변수)
• Continuous (연속형) variable (변수)
• Covariation (공분산): Bivariate distributions (이변량 분포)
• Continuous (연속형) & Categorical (범주형)
• Categorical (범주형) & Categorical (범주형)
• Continuous (연속형) & Continuous (연속형)
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
• Iterative cycles (반복 순환) of Exploratory Data Analysis (EDA, 탐색적
데이터 분석)
1. Generate questions (질문) about your data
1. What type of variation (다양성, 분산) occurs within my variables?
2. What type of covariation (공분산) occurs between my variables?
2. Search for answers (답): Transform (변환하다) & Visualize (시각화하다) &
Model(모형을 만들다)
3. Use what you learn to refine (개선하다) your questions and/or generate (생성하다)
new questions.
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
1. Generate questions (질문) about your data
1) What type of variation (다양성, 분산) occurs within my variables?
• Univariate distributions (일변량 분포)
2) What type of covariation (공분산) occurs between my variables?
• Bivariate distributions (이변량 분포)
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Variation (다양성, 분산):
Univariate distributions (일변량 분포)
diamonds data
diamonds
# # A tibble: 53,940 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# # ... with 53,930 more rows
help(diamonds)
diamonds {ggplot2} R Documentation
Prices of 50,000 round cut diamonds
Description
A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
Usage
diamonds
Format
A data frame with 53940 rows and 10 variables:
price
price in US dollars ($326–$18,823)
carat
weight of the diamond (0.2–5.01)
cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color
diamond colour, from J (worst) to D (best)
clarity
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x
length in mm (0–10.74)
y
width in mm (0–58.9)
z
depth in mm (0–31.8)
depth
total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table
width of top of diamond relative to widest point (43–95)
Categorical (범주형) variable (변수) vs.
Continuous (연속형) variable (변수)
diamonds
# # A tibble: 53,940 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# # ... with 53,930 more rows
diamonds %>%
count(cut)
#> # A tibble: 5 x 2
#> cut n
#> <ord> <int>
#> 1 Fair 1610
#> 2 Good 4906
#> 3 Very Good 12082
#> 4 Premium 13791
#> 5 Ideal 21551
diamonds %>%
count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#> `cut_width(carat, 0.5)` n
#> <fct> <int>
#> 1 [-0.25,0.25] 785
#> 2 (0.25,0.75] 29498
#> 3 (0.75,1.25] 15977
#> 4 (1.25,1.75] 5313
#> 5 (1.75,2.25] 2002
#> 6 (2.25,2.75] 322
#> # ... with 5 more rows
Visualizing univariate distributions (일변량 분포의 시각화)
Categorical (범주형) variable (변수)
geom_bar
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Visualizing univariate distributions (일변량 분포의 시각화)
Continuous (연속형) variable (변수)
geom_histogram
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
Visualizing univariate distributions (일변량 분포의 시각화)
Continuous (연속형) variable (변수)
geom_histogram
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
Visualizing univariate distributions (일변량 분포의 시각화)
Continuous (연속형) variable (변수)
geom_freqpoly
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_freqpoly(mapping = aes(x = carat), binwidth = 0.1)
0
2500
5000
7500
10000
0 1 2 3
carat
count
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_freqpoly & color
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_freqpoly(mapping = aes(x = carat, color = cut), binwidth = 0.1)
Visualizing univariate distributions (일변량 분포의 시각화)
• Questions to ask
• Which values are the most common? Why?
• Which values are rare? Why? Does that match your expectations?
• Can you see any unusual patterns? What might explain them?
Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to
the left of each peak?
Why are there no diamonds bigger than 3 carats?
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.01)
Visualizing univariate distributions (일변량 분포의 시각화)
Unusual values (outliers, 특이값, 드문 값)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
Visualizing univariate distributions (일변량 분포의 시각화)
Unusual values (outliers, 특이값, 드문 값)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
Unusual values (outliers, 특이값, 드문 값)
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
#> # A tibble: 9 x 4
#> price x y z
#> <int> <dbl> <dbl> <dbl>
#> 1 5139 0 0 0
#> 2 6381 0 0 0
#> 3 12800 0 0 0
#> 4 15686 0 0 0
#> 5 18034 0 0 0
#> 6 2130 0 0 0
#> 7 2130 0 0 0
#> 8 2075 5.15 31.8 5.12
#> 9 12210 8.09 58.9 8.06
#@ replacing the unusual values with missing values
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
Visualizing univariate distributions (일변량 분포의 시각화)
Missing values vs. non-missing values
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
More on visualizing univariate distribution
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Covariation (공분산):
Bivariate distributions (이변량 분포)
Continuous (연속형) & Categorical (범주형)
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_freqpoly & color
gplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
Visualizing univariate distributions (일변량 분포의 시각화)
Categorical (범주형) variable (변수)
geom_bar
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_freqpoly & color
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
boxplot
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_boxplot
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_boxplot
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_boxplot
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_boxplot
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Covariation (공분산):
Bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
"long format": unique identifier (고유 식별자) = composite key (복합키)
#@ count in a "long format": unique identifier is the composite key consists of color & cut
diamonds %>%
group_by(color, cut) %>% summarize(count = n()) %>% print(n = Inf)
diamonds %>%
count(color, cut) %>% print(n = Inf)
# # A tibble: 35 x 3
# color cut n
# <ord> <ord> <int>
# 1 D Fair 163
# 2 D Good 662
# 3 D Very Good 1513
# 4 D Premium 1603
# 5 D Ideal 2834
# 6 E Fair 224
# 7 E Good 933
# 8 E Very Good 2400
# 9 E Premium 2337
# 10 E Ideal 3903
# 11 F Fair 312
# 12 F Good 909
# 13 F Very Good 2164
# 14 F Premium 2331
# 15 F Ideal 3826
# 16 G Fair 314
# 17 G Good 871
# 18 G Very Good 2299
# 19 G Premium 2924
# 20 G Ideal 4884
# 21 H Fair 303
# 22 H Good 702
# 23 H Very Good 1824
# 24 H Premium 2360
# 25 H Ideal 3115
# 26 I Fair 175
# 27 I Good 522
# 28 I Very Good 1204
# 29 I Premium 1428
# 30 I Ideal 2093
# 31 J Fair 119
# 32 J Good 307
# 33 J Very Good 678
# 34 J Premium 808
# 35 J Ideal 896
bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
"wide format": one variable -> multiple column
#@ "wide format": one variable -> multiple column (spread)
diamonds %>%
select(color, cut) %>%
table %>% addmargins
# cut
# color Fair Good Very Good Premium Ideal Sum
# D 163 662 1513 1603 2834 6775
# E 224 933 2400 2337 3903 9797
# F 312 909 2164 2331 3826 9542
# G 314 871 2299 2924 4884 11292
# H 303 702 1824 2360 3115 8304
# I 175 522 1204 1428 2093 5422
# J 119 307 678 808 896 2808
# Sum 1610 4906 12082 13791 21551 53940
#@ spread(): "long format" -> "wide format": one variable -> multiple column
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n)
# # A tibble: 7 x 6
# color Fair Good `Very Good` Premium Ideal
# <ord> <int> <int> <int> <int> <int>
# 1 D 163 662 1513 1603 2834
# 2 E 224 933 2400 2337 3903
# 3 F 312 909 2164 2331 3826
# 4 G 314 871 2299 2924 4884
# 5 H 303 702 1824 2360 3115
# 6 I 175 522 1204 1428 2093
# 7 J 119 307 678 808 896
Visualizing bivariate distributions (이변량 분포의 시각화)
Categorical (범주형) & Categorical (범주형)
geom_count
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
Visualizing bivariate distributions (이변량 분포의 시각화)
Categorical (범주형) & Categorical (범주형)
geom_tile
diamonds %>% count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
"long format": unique identifier (고유 식별자) = composite key (복합키)
#@ gather(): "wide format" -> "long format": multiple column -> one variable (key)
#@ "wide format": one variable -> multiple column (spread)
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n) %>%
gather(key = cut, value = n, Fair, Good, `Very Good`, Premium, Ideal)
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n) %>%
gather(key = cut, value = n, Fair:Ideal)
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n) %>%
gather(key = cut, value = n, -color)
# > diamonds %>%
# + count(color, cut) %>%
# + spread(key = cut, value = n) %>%
# + gather(key = cut, value = n, -color)
# # A tibble: 35 x 3
# color cut n
# <ord> <chr> <int>
# 1 D Fair 163
# 2 E Fair 224
# 3 F Fair 312
# 4 G Fair 314
# 5 H Fair 303
# 6 I Fair 175
# 7 J Fair 119
# 8 D Good 662
# 9 E Good 933
# 10 F Good 909
# # ... with 25 more rows
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Covariation (공분산):
Bivariate distributions (이변량 분포)
Continuous (연속형) & Continuous (연속형)
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
geom_point
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
geom_point
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
geom_bin2d
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_bin2d(mapping = aes(x = carat, y = price))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
geom_hex
# install.packages("hexbin")
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_hex(mapping = aes(x = carat, y = price))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
-> Continuous (연속형) & Categorical (범주형)
ggplot(diamonds %>% filter(carat < 3), mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
More on visualizing bivariate distribution
More on visualizing trivariate distribution
REFERENCES
#1. RStudio Official Documentations (Help & Cheat Sheet)
Free Webpage) https://www.rstudio.com/resources/cheatsheets/
#2. Wickham, H. and Grolemund, G., 2016.R for data science:
import, tidy, transform, visualize, and model data. O'Reilly.
Free Webpage) https://r4ds.had.co.nz/
Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base
syntax
Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct
Professor of Statistics at the University of Auckland, Stanford
University, and Rice University

Weitere ähnliche Inhalte

Was ist angesagt? (8)

Analysis of single samples
Analysis of single samplesAnalysis of single samples
Analysis of single samples
 
Lpp simplex method
Lpp simplex methodLpp simplex method
Lpp simplex method
 
comp.org Chapter 2
comp.org Chapter 2comp.org Chapter 2
comp.org Chapter 2
 
BS2506 tutorial 2
BS2506 tutorial 2BS2506 tutorial 2
BS2506 tutorial 2
 
Sect3 4
Sect3 4Sect3 4
Sect3 4
 
Pernos ejercicio 8
Pernos ejercicio 8Pernos ejercicio 8
Pernos ejercicio 8
 
Mth 4108-1 - chapter 9 (ans)
Mth 4108-1 - chapter 9 (ans)Mth 4108-1 - chapter 9 (ans)
Mth 4108-1 - chapter 9 (ans)
 
The effects of cold weather on wind data quality – An empirical study on how ...
The effects of cold weather on wind data quality – An empirical study on how ...The effects of cold weather on wind data quality – An empirical study on how ...
The effects of cold weather on wind data quality – An empirical study on how ...
 

Ähnlich wie r for data science 4. exploratory data analysis clean -rev -ref

Variance and standard deviation
Variance and standard deviationVariance and standard deviation
Variance and standard deviation
Amrit Swaroop
 
SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4
Rashmi Sinha
 
measure of variability (windri). In research include example
measure of variability (windri). In research include examplemeasure of variability (windri). In research include example
measure of variability (windri). In research include example
windri3
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
SmartHinJ
 
R graphics260809
R graphics260809R graphics260809
R graphics260809
lizbethfdz
 

Ähnlich wie r for data science 4. exploratory data analysis clean -rev -ref (20)

Variance and standard deviation
Variance and standard deviationVariance and standard deviation
Variance and standard deviation
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional Data
 
YamadaiR(Categorical Factor Analysis)
YamadaiR(Categorical Factor Analysis)YamadaiR(Categorical Factor Analysis)
YamadaiR(Categorical Factor Analysis)
 
RBootcamp Day 4
RBootcamp Day 4RBootcamp Day 4
RBootcamp Day 4
 
SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4
 
Overview of variance and Standard deviation.pptx
Overview of variance and Standard deviation.pptxOverview of variance and Standard deviation.pptx
Overview of variance and Standard deviation.pptx
 
VARIANCE AND STANDARD DEVIATION.pptx
VARIANCE AND STANDARD DEVIATION.pptxVARIANCE AND STANDARD DEVIATION.pptx
VARIANCE AND STANDARD DEVIATION.pptx
 
measure of variability (windri). In research include example
measure of variability (windri). In research include examplemeasure of variability (windri). In research include example
measure of variability (windri). In research include example
 
3.3 Measures of Variation
3.3 Measures of Variation3.3 Measures of Variation
3.3 Measures of Variation
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
 
Measures of Variability.pptx
Measures of Variability.pptxMeasures of Variability.pptx
Measures of Variability.pptx
 
Prediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionPrediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate Regression
 
Variability
VariabilityVariability
Variability
 
Standard deviation quartile deviation
Standard deviation  quartile deviationStandard deviation  quartile deviation
Standard deviation quartile deviation
 
Variables and Statements
Variables and StatementsVariables and Statements
Variables and Statements
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
 
R graphics260809
R graphics260809R graphics260809
R graphics260809
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Staisticsii
StaisticsiiStaisticsii
Staisticsii
 
Mean, Variance and standard deviation.pptx
Mean, Variance and standard deviation.pptxMean, Variance and standard deviation.pptx
Mean, Variance and standard deviation.pptx
 

Mehr von Min-hyung Kim

20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
Min-hyung Kim
 

Mehr von Min-hyung Kim (7)

20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf
20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf
20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf
 
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
 
MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709
 
MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709
 
MH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -cleanMH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -clean
 
r for data science 2. grammar of graphics (ggplot2) clean -ref
r for data science 2. grammar of graphics (ggplot2)  clean -refr for data science 2. grammar of graphics (ggplot2)  clean -ref
r for data science 2. grammar of graphics (ggplot2) clean -ref
 
CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403
CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403
CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403
 

Kürzlich hochgeladen

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 

r for data science 4. exploratory data analysis clean -rev -ref

  • 1. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Outline (개요) • Exploratory Data Analysis (EDA, 탐색적 데이터 분석) • Variation (다양성, 분산): Univariate distributions (일변량 분포) • Categorical (범주형) variable (변수) • Continuous (연속형) variable (변수) • Covariation (공분산): Bivariate distributions (이변량 분포) • Continuous (연속형) & Categorical (범주형) • Categorical (범주형) & Categorical (범주형) • Continuous (연속형) & Continuous (연속형)
  • 2. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) • Iterative cycles (반복 순환) of Exploratory Data Analysis (EDA, 탐색적 데이터 분석) 1. Generate questions (질문) about your data 1. What type of variation (다양성, 분산) occurs within my variables? 2. What type of covariation (공분산) occurs between my variables? 2. Search for answers (답): Transform (변환하다) & Visualize (시각화하다) & Model(모형을 만들다) 3. Use what you learn to refine (개선하다) your questions and/or generate (생성하다) new questions.
  • 3. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) 1. Generate questions (질문) about your data 1) What type of variation (다양성, 분산) occurs within my variables? • Univariate distributions (일변량 분포) 2) What type of covariation (공분산) occurs between my variables? • Bivariate distributions (이변량 분포)
  • 4. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Variation (다양성, 분산): Univariate distributions (일변량 분포)
  • 5. diamonds data diamonds # # A tibble: 53,940 x 10 # carat cut color clarity depth table price x y z # <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> # 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 # 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 # 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 # 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 # 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 # 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 # 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 # 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 # 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 # 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # # ... with 53,930 more rows help(diamonds)
  • 6. diamonds {ggplot2} R Documentation Prices of 50,000 round cut diamonds Description A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows: Usage diamonds Format A data frame with 53940 rows and 10 variables: price price in US dollars ($326–$18,823) carat weight of the diamond (0.2–5.01) cut quality of the cut (Fair, Good, Very Good, Premium, Ideal) color diamond colour, from J (worst) to D (best) clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) x length in mm (0–10.74) y width in mm (0–58.9) z depth in mm (0–31.8) depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) table width of top of diamond relative to widest point (43–95)
  • 7. Categorical (범주형) variable (변수) vs. Continuous (연속형) variable (변수) diamonds # # A tibble: 53,940 x 10 # carat cut color clarity depth table price x y z # <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> # 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 # 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 # 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 # 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 # 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 # 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 # 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 # 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 # 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 # 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # # ... with 53,930 more rows diamonds %>% count(cut) #> # A tibble: 5 x 2 #> cut n #> <ord> <int> #> 1 Fair 1610 #> 2 Good 4906 #> 3 Very Good 12082 #> 4 Premium 13791 #> 5 Ideal 21551 diamonds %>% count(cut_width(carat, 0.5)) #> # A tibble: 11 x 2 #> `cut_width(carat, 0.5)` n #> <fct> <int> #> 1 [-0.25,0.25] 785 #> 2 (0.25,0.75] 29498 #> 3 (0.75,1.25] 15977 #> 4 (1.25,1.75] 5313 #> 5 (1.75,2.25] 2002 #> 6 (2.25,2.75] 322 #> # ... with 5 more rows
  • 8. Visualizing univariate distributions (일변량 분포의 시각화) Categorical (범주형) variable (변수) geom_bar ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
  • 9. Visualizing univariate distributions (일변량 분포의 시각화) Continuous (연속형) variable (변수) geom_histogram ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
  • 10. Visualizing univariate distributions (일변량 분포의 시각화) Continuous (연속형) variable (변수) geom_histogram ggplot(data = diamonds %>% filter(carat < 3)) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
  • 11. Visualizing univariate distributions (일변량 분포의 시각화) Continuous (연속형) variable (변수) geom_freqpoly ggplot(data = diamonds %>% filter(carat < 3)) + geom_freqpoly(mapping = aes(x = carat), binwidth = 0.1) 0 2500 5000 7500 10000 0 1 2 3 carat count
  • 12. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_freqpoly & color ggplot(data = diamonds %>% filter(carat < 3)) + geom_freqpoly(mapping = aes(x = carat, color = cut), binwidth = 0.1)
  • 13. Visualizing univariate distributions (일변량 분포의 시각화) • Questions to ask • Which values are the most common? Why? • Which values are rare? Why? Does that match your expectations? • Can you see any unusual patterns? What might explain them?
  • 14. Why are there more diamonds at whole carats and common fractions of carats? Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak? Why are there no diamonds bigger than 3 carats? ggplot(data = diamonds %>% filter(carat < 3)) + geom_histogram(mapping = aes(x = carat), binwidth = 0.01)
  • 15. Visualizing univariate distributions (일변량 분포의 시각화) Unusual values (outliers, 특이값, 드문 값) ggplot(data = diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5)
  • 16. Visualizing univariate distributions (일변량 분포의 시각화) Unusual values (outliers, 특이값, 드문 값) ggplot(data = diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5) + coord_cartesian(ylim = c(0, 50))
  • 17. Unusual values (outliers, 특이값, 드문 값) unusual <- diamonds %>% filter(y < 3 | y > 20) %>% select(price, x, y, z) %>% arrange(y) unusual #> # A tibble: 9 x 4 #> price x y z #> <int> <dbl> <dbl> <dbl> #> 1 5139 0 0 0 #> 2 6381 0 0 0 #> 3 12800 0 0 0 #> 4 15686 0 0 0 #> 5 18034 0 0 0 #> 6 2130 0 0 0 #> 7 2130 0 0 0 #> 8 2075 5.15 31.8 5.12 #> 9 12210 8.09 58.9 8.06 #@ replacing the unusual values with missing values diamonds2 <- diamonds %>% mutate(y = ifelse(y < 3 | y > 20, NA, y))
  • 18. Visualizing univariate distributions (일변량 분포의 시각화) Missing values vs. non-missing values nycflights13::flights %>% mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + sched_min / 60 ) %>% ggplot(mapping = aes(sched_dep_time)) + geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
  • 19. More on visualizing univariate distribution
  • 20. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Covariation (공분산): Bivariate distributions (이변량 분포) Continuous (연속형) & Categorical (범주형)
  • 21. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_freqpoly & color gplot(data = diamonds, mapping = aes(x = price)) + geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
  • 22. Visualizing univariate distributions (일변량 분포의 시각화) Categorical (범주형) variable (변수) geom_bar ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
  • 23. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_freqpoly & color ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
  • 24. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) boxplot
  • 25. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_boxplot ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + geom_boxplot()
  • 26. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_boxplot ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot()
  • 27. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_boxplot ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
  • 28. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_boxplot ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + coord_flip()
  • 29. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Covariation (공분산): Bivariate distributions (이변량 분포) Categorical (범주형) & Categorical (범주형)
  • 30. bivariate distributions (이변량 분포) Categorical (범주형) & Categorical (범주형) "long format": unique identifier (고유 식별자) = composite key (복합키) #@ count in a "long format": unique identifier is the composite key consists of color & cut diamonds %>% group_by(color, cut) %>% summarize(count = n()) %>% print(n = Inf) diamonds %>% count(color, cut) %>% print(n = Inf) # # A tibble: 35 x 3 # color cut n # <ord> <ord> <int> # 1 D Fair 163 # 2 D Good 662 # 3 D Very Good 1513 # 4 D Premium 1603 # 5 D Ideal 2834 # 6 E Fair 224 # 7 E Good 933 # 8 E Very Good 2400 # 9 E Premium 2337 # 10 E Ideal 3903 # 11 F Fair 312 # 12 F Good 909 # 13 F Very Good 2164 # 14 F Premium 2331 # 15 F Ideal 3826 # 16 G Fair 314 # 17 G Good 871 # 18 G Very Good 2299 # 19 G Premium 2924 # 20 G Ideal 4884 # 21 H Fair 303 # 22 H Good 702 # 23 H Very Good 1824 # 24 H Premium 2360 # 25 H Ideal 3115 # 26 I Fair 175 # 27 I Good 522 # 28 I Very Good 1204 # 29 I Premium 1428 # 30 I Ideal 2093 # 31 J Fair 119 # 32 J Good 307 # 33 J Very Good 678 # 34 J Premium 808 # 35 J Ideal 896
  • 31. bivariate distributions (이변량 분포) Categorical (범주형) & Categorical (범주형) "wide format": one variable -> multiple column #@ "wide format": one variable -> multiple column (spread) diamonds %>% select(color, cut) %>% table %>% addmargins # cut # color Fair Good Very Good Premium Ideal Sum # D 163 662 1513 1603 2834 6775 # E 224 933 2400 2337 3903 9797 # F 312 909 2164 2331 3826 9542 # G 314 871 2299 2924 4884 11292 # H 303 702 1824 2360 3115 8304 # I 175 522 1204 1428 2093 5422 # J 119 307 678 808 896 2808 # Sum 1610 4906 12082 13791 21551 53940 #@ spread(): "long format" -> "wide format": one variable -> multiple column diamonds %>% count(color, cut) %>% spread(key = cut, value = n) # # A tibble: 7 x 6 # color Fair Good `Very Good` Premium Ideal # <ord> <int> <int> <int> <int> <int> # 1 D 163 662 1513 1603 2834 # 2 E 224 933 2400 2337 3903 # 3 F 312 909 2164 2331 3826 # 4 G 314 871 2299 2924 4884 # 5 H 303 702 1824 2360 3115 # 6 I 175 522 1204 1428 2093 # 7 J 119 307 678 808 896
  • 32. Visualizing bivariate distributions (이변량 분포의 시각화) Categorical (범주형) & Categorical (범주형) geom_count ggplot(data = diamonds) + geom_count(mapping = aes(x = cut, y = color))
  • 33. Visualizing bivariate distributions (이변량 분포의 시각화) Categorical (범주형) & Categorical (범주형) geom_tile diamonds %>% count(color, cut) %>% ggplot(mapping = aes(x = color, y = cut)) + geom_tile(mapping = aes(fill = n))
  • 34. bivariate distributions (이변량 분포) Categorical (범주형) & Categorical (범주형) "long format": unique identifier (고유 식별자) = composite key (복합키) #@ gather(): "wide format" -> "long format": multiple column -> one variable (key) #@ "wide format": one variable -> multiple column (spread) diamonds %>% count(color, cut) %>% spread(key = cut, value = n) %>% gather(key = cut, value = n, Fair, Good, `Very Good`, Premium, Ideal) diamonds %>% count(color, cut) %>% spread(key = cut, value = n) %>% gather(key = cut, value = n, Fair:Ideal) diamonds %>% count(color, cut) %>% spread(key = cut, value = n) %>% gather(key = cut, value = n, -color) # > diamonds %>% # + count(color, cut) %>% # + spread(key = cut, value = n) %>% # + gather(key = cut, value = n, -color) # # A tibble: 35 x 3 # color cut n # <ord> <chr> <int> # 1 D Fair 163 # 2 E Fair 224 # 3 F Fair 312 # 4 G Fair 314 # 5 H Fair 303 # 6 I Fair 175 # 7 J Fair 119 # 8 D Good 662 # 9 E Good 933 # 10 F Good 909 # # ... with 25 more rows
  • 35. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Covariation (공분산): Bivariate distributions (이변량 분포) Continuous (연속형) & Continuous (연속형)
  • 36. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) geom_point ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price))
  • 37. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) geom_point ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
  • 38. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) geom_bin2d ggplot(data = diamonds %>% filter(carat < 3)) + geom_bin2d(mapping = aes(x = carat, y = price))
  • 39. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) geom_hex # install.packages("hexbin") ggplot(data = diamonds %>% filter(carat < 3)) + geom_hex(mapping = aes(x = carat, y = price))
  • 40. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) -> Continuous (연속형) & Categorical (범주형) ggplot(diamonds %>% filter(carat < 3), mapping = aes(x = carat, y = price)) + geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
  • 41. More on visualizing bivariate distribution
  • 42. More on visualizing trivariate distribution
  • 43. REFERENCES #1. RStudio Official Documentations (Help & Cheat Sheet) Free Webpage) https://www.rstudio.com/resources/cheatsheets/ #2. Wickham, H. and Grolemund, G., 2016.R for data science: import, tidy, transform, visualize, and model data. O'Reilly. Free Webpage) https://r4ds.had.co.nz/ Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base syntax Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University