How to analyse new dataset in R? What libraries to use, and what commands? How to understand your dataset in few minutes? Read my presentation for Data Science Club by Exponea and find out!
3. Martin Bago
Data Scientist | Instarea
Ing. @ Process Automation and Informatization in Industry (2016, MTF STU BA)
Bc. @ Applied Informatics (2014, FEI STU BA)
2017- now Data Scientist, Instarea s.r.o., Market Locator
2015-2016 Head of Analyst, News and Media Holding a.s.
2014-2015 SEO Analyst, Centrum Holdings a.s.
2011-2014 Automix.sk, Centrum Holdings a.s.
2010-2013 Editor-in-chief OKO Casopis (FEI STU BA)
Passionate driver, beer&coffee&football lover
6. Dataset
>> install.packages("datasets") #installing datasets package in R
>> library(datasets)
For studying there is an unique library consisting of many real-life dataset examples (from Monthly
Airline Passenger Numbers, thru Weight versus age of chicks on different diets to Monthly Deaths from
Lung Diseases in the UK) .
For this presentation we will use mtcars dataset.
How to find&use
7. Baby steps
head(), tail(), nrow() and ncol()
To understand, what are you working with is very important to see dimensions of dataset a number/count
of values.
>> head(mtcars)
>> tail(mtcars)
>> head(mtcars, 25)
>> nrow(mtcars)
>> ncol(mtcars)
Input: Output:
8. Deeper insight
str(), summary()
To deeper understanding of dataset use detailed views of metrics and
dimensions.
>> str(mtcars)
>> summary(mtcars)
Input: Output:
Always check data types!!!
Source
9. Unique and missing values
unique(), is.na()
Is crucial to find, how many values are missing from the dataset. If there is 2/3 missing,
you got wrong dataset.
>> unique(mtcars$cyl)
>> is.na(mtcars)
Input: Output:
If there is something missing, you can
use old&good method to treat that –
filling with mean.
>> mtcars$smt[is.na(mtcars$smt)] <-
mean(mtcars$smt, na.rm = TRUE)
10. Histograms
hist()
The best way to learn and understand, is visual
>> hist(mtcars$mpg)
>> hist(mtcars$hp)
Input: Output:
Output:
11. Transforming and recalculating
Often you need to calculate your own metrics. In R, it’s really
easy.
>> mtcars2 <- mtcars
>> mtcars2$disp_l <- mtcars$mpg/61.024
>> mtcars2$kml <- 235/mtcars$mpg
>> hist(mtcars2$disp_l)
Input: Output:
12. Understand the scope of
variablesboxplot()
>> boxplot(mtcars)
>> boxplot(mtcars2$disp_l, mtcars2$kml)
>> boxplot(mtcars2$kml, main = "mtcars dataset",
xlab = "Comsumption per 100km", ylab = "Liters")
Input:
Output:
Output:
14. Does it correlate?
Library(corplot), cor()
>> install.packages("corrplot")
>> library(corrplot)
>> #cor(x, method = "pearson", use = "complete.obs")
>> cor(mtcars)
Input:
Output: Not very intuitive…
15. Does it correlate?
Library(corplot), cor()
>> res <- cor(mtcars)
>> round(res, 2)
>> corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 25)
Input: Output:
! Becareful !
Correlation is not causality
16. Heatmap via corrplot library
>> library(corrplot)
>> col<- colorRampPalette(c("blue", "white", "red"))(20)
>> heatmap(x = res, col = col, symm = TRUE)
Input: Output:
Does it correlate?
17. Or even deeper insight…
>>require(graphics)
pairs(mtcars2, main = "mtcars2 data", gap = 1/4)
coplot(kml ~ disp_l | as.factor(cyl), data = mtcars2,
panel = panel.smooth, rows = 1)
## possibly more meaningful, e.g., for summary() or
bivariate plots:
mtcars2 <- within(mtcars2, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
summary(mtcars2)
Input: Output:
Library(corplot), cor()
22. Stay in touch
Instarea s.r.o.
29. Augusta 36/A
811 09 Bratislava
www.instarea.com
Martin Bago
Data Scientist
Instarea
martin.bago@instarea.com
+421 905 255 852
https://www.linkedin.com/in/martinbago/
Thank you!