SlideShare ist ein Scribd-Unternehmen logo
1 von 66
The Data Analyst’s
Toolkit
Introduction to R
Jen Stirrup | Data Relish Ltd| June, 2014
Jen.Stirrup@datarelish.com
Note
• This presentation was part of a full day workshop on Power BI and R,
held at SQLBits in 2014
• This is a sample, provided to help you see if my one day Business
Intelligence Masterclass is the right course for you.
• http://bit.ly/BusinessIntelligence2016Masterclass
• In that course, you’ll be given updated notes along with a hands-on
session, so why not join me?
2
Course Outline
• Module 1: Setting up your data for R with Power Query
• Module 2: Introducing R
• Module 3: The Big Picture: Putting Power BI and R together
• Module 4: Visualising your data with Power View and Excel 2013
• Module 5: Power Map
• Module 6: Wrap up and Q and Q
What is R?
4
• R is a powerful environment for statistical computing
• It is an overgrown calculator
• … which lets you save results in variables
x <- 3
y <- 5
z = 4
x + y + z
Vectors in R
5
• create a vector (list) of elements, use the "c" operator
v = c("hello","world","welcome","to","the class.")
v = seq(1,100)
v[1]
v[1:10]
• Subscripting in R square brackets operators allow you to extract values:
• insert logical expressions in the square brackets to retrieve subsets of data from a vector or list. For
example:
Vectors in R
Microsoft Confidential 6
v = seq(1,100)
logi = v>95
logi
v[logi]
v[v<6]
v[105]=105
v[is.na(v)]
Save and Load RData
Data is saved in R as .Rdata files
Imported back again with load
a <- 1:10
save(a, file = "E:/MyData.Rdata")
rm(a)
load("E:/MyData.Rdata")
print(a)
Import From CSV Files
• A simple way to load in data is to read in a CSV.
• Read.csv()
• MyDataFrame <- read.csv(“filepath.csv")
• print(MyDataFrame)
Import From CSV Files
• Go to Tools in RStudio, and select Import
Dataset.
• Select the file CountryCodes.csv and select the
Import button.
• In RStudio, you will now see the data in the data
pane.
Import From CSV Files
The console window will show the following:
> #import dataset
> CountryCodes <- read.csv("C:/Program Files/R/R-
3.1.0/Working Directory/CountryCodes.csv", header=F)
> View(CountryCodes)
Once the data is imported, we can check the
data.
dim(CountryCodes)
head(CountryCodes)
tail(CountryCodes)
Import / Export via ODBC
• The Package RODBC provides R with a connection
to ODBC databases
• library(RODBC)
• myodbcConnect <-
odbcConnect(dsn="servername",uid="us
erid",pwd="******")
Import / Export via ODBC
• Query <- "SELECT * FROM lib.table WHERE
..."
• # or read query from file
myQuery <-
readChar("E:/MyQueries/myQuery.sql",
nchars=99999)
myData <- sqlQuery(myodbcConnect,
myQuery, errors=TRUE)
odbcCloseAll()
Import/Export from Excel Files
• RODBC also works for importing data from Excel
files
• library(RODBC)
• filename <- "E:/Rtmp/dummmyData.xls"
• myxlsFile <- odbcConnectExcel(filename, readOnly =
FALSE)
• sqlSave(myxlsFile, a, rownames = FALSE)
• b <- sqlFetch(myxlsFile, "a")
• odbcCloseAll()
Anscombe’s Quartet
Property Value
Mean of X 9
Variance of X 11
Mean of Y 7.50
Variance of Y 4.1
Correlation 0.816
Linear Regression Y = 3.00 + 0.5
14
What does Anscombe’s Quartet look like?
15
Looks good, doesn’t it?
16
So, it is correct?
17
Correlation r = 0.96
18
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Number of
people
who died
by
becoming
tangled in
their
bedsheets
Deaths (US)
(CDC)
327 456 509 497 596 573 661 741 809 717
Total
revenue
generated
by skiing
facilities
(US)
Dollars in
millions
(US Census)
1,551 1,635 1,801 1,827 1,956 1,989 2,178 2,257 2,476 2,438
R and Power BI together
• Pivot Tables are not always enough
• Scaling Data (ScaleR)
• R is very good at static data visualisation
• Upworthy
19
Why R?
• most widely used data analysis software - used by 2M + data
scientist, statisticians and analysts
• Most powerful statistical programming language
• flexible, extensible and comprehensive for productivity
• Create beautiful and unique data visualisations - as seen in New
York Times, Twitter and Flowing Data
• Thriving open-source community - leading edge of analytics
research
• Fills the talent gap - new graduates prefer R.
20
Growth in Demand
• Rexer Data Mining survey, 2007 - 2013
• R is the highest paid IT skill Dice.com, Jan 2014
• R most used-data science language after SQL -
O'Reilly, Jan 2014
• R is used by 70% of data miners. Rexer, Sept 2013
21
Growth in Demand
• R is #15 of all programming languages.
• RedMonk, Jan 2014
• R growing faster than any other data science
language.
• KDNuggets.
• R is in-memory and limited in the size of data that
you can process.
22
What are we testing?
• We have one or two samples and a hypothesis,
which may be true or false.
• The NULL hypothesis – nothing happened.
• The Alternative hypothesis – something did happen.
23
Strategy
• We set out to prove that something did happen.
• We look at the distribution of the data.
• We choose a test statistic
• We look at the p value
24
How small is too small?
• How do we know when the p-value is small?
• P => 0.05 – Null hypothesis is true
• P < 0.05 – alternative hypothesis is true
• it depends
• For high-risk, then perhaps we want 0.01 or even
0.001.
25
Confidence Intervals
• Basically, how confident are you that you can
extrapolate from your little data set to the larger
population?
• We can look at the mean
• To do this, we run a t.test
• t.test(vector)
26
Confidence Intervals
• Basically, how confident are you that you can
extrapolate from your little data set to the larger
population?
• We can look at the median
• To do this, we run a Wilcox test.
• t.test(vector)
27
Calculate the relative frequency
• How much is above, or below the mean?
• Mean(after > before)
• Mean(abs(x-mean)) < 2 *sd(s)
• This gives you the fraction of data that is greater
than two standard deviations from the mean.
28
Testing Categorical Variables for
Independence
• Chi squares – are two variables independent? Are
they connected in some way?
• Summarise the data first: Summary(table(initial,
outcome))
• chisq.test
29
How Statistics answers your question
• Is our model significant or insignificant? – The F Statistic
• What is the quality of the model? – R2 statistic
• How well do the data points fit the model? – R2 statistic
What do the values mean together?
The type of
analysis
Test statistic How can you tell if it is
significant?
What is the assumption you can make?
Regression analysis F Big F, Small p < 0.05 A general relationship between the
predictors and the response
Regression
Analysis
t Big t (> +2.0
or < -2.0), small p < 0.05
X is an important predictor
Difference of
means
t (two tailed) Big t (> +2.0
or < -2.0), small p < 0.05
Significant difference of means
Difference of
means
t (one tailed) Big t (> +2.0
or < -2.0), small p < 0.05
Significant difference of means
31
What is Regression?
Using predictors to predict a response
Using independent variables to predict a dependent variable
Example: Credit score is a response, predicted by spend,
income, location and so on.
Linear Regression using World Bank data
We can look at predicting using World Bank data
Year <-
GDP <- (wdiData, )
Plot(wdiData,
Cor(year, wdiData)
Fit <- lm(cpi ~ year+quarter)
Fit
Examples of Data Mining in R
 cpi2011 <- fit$coefficients[[1]] + fit$coefficients[[2]]*2011 +
fit$coefficients[[3]]*(1:4)
attributes(fit)
fit$coefficients
Residuals(fit) – difference between observed and fitted values
Summary(fit)
Plot(fit)
What is Data Mining
Machine Learning
Statistics
Software Engineering and Programming with Data
Intuition
Fun!
The Why of Data Mining
to discover new knowledge
to improve business outcomes
to deliver better customised services
Examples of Data Mining in R
Logistic Regression (glm)
Decision Trees (rpart, wsrpart)
Random Forests (randomForest, wsrf)
Boosted Stumps (ada)
Neural Networks (nnet)
Support Vector Machines (kernlab)
Examples of Data Mining in R
• Packages: – fpc – cluster – pvclust – mclust
• Partitioning-based clustering: kmeans, pam, pamk,
clara
• Hierarchical clustering: hclust, pvclust, agnes, Diana
• Model-based clustering: mclust
• Density-based clustering: dbscan
• Plotting cluster solutions: plotcluster, plot.hclust
• Validating cluster solutions: cluster.stats
How can we make it easier?
• AzureML
The Data Mining Process
• Load data
• Choose your variables
• Sample the data into test and training sets (usually 70/30 split)
• Explore the distributions of the data
• Test some distributions
• Transform the data if required
• Build clusters with the data
• Build a model
• Evaluate the model
• Log the data process for auditing externally
Loading the Data
• Dsname is the name of our dataset
• Get(dsname)
• Dim(ds)
• Names(ds)
Explore the data
• Head(dataset)
• Tail(dataset)
Explore the data’s structure
• Str(dataset)
• Summary(dataset)
Pick out the Variables
• Id <- c(“Date”, “Location) target <- “RainTomorrow”
risk <- “RISK_MM”
• (ignore <-union(id, risk))
• (vars <- setdiff(names(ds), ignore))
Remove Missing Data
• dim(ds) ## [1] 366 24 sum(is.na(ds[vars]))
• ## [1] 47 ds <- ds[-attr(na.omit(ds[vars]),
"na.action"),]
• dim(ds) ## [1] 328 24 sum(is.na(ds[vars]))
• ## [1] 0
Clean Data Target as Categorical Data
• summary(ds[target])
• ## RainTomorrow ## Min. :0.000 ## 1st Qu.:0.000
• ## Median :0.000 ## Mean :0.183 ## 3rd Qu.:0.000
## Max. :1.000
• ....
• ds[target] <- as.factor(ds[[target]]) levels(ds[target])
<- c("No", "Yes")
• summary(ds[target])
Model Preparation
• (form <- formula(paste(target, "~ ."))) ##
RainTomorrow ~ .
• (nobs <- nrow(ds)) ## [1] 328
• train <- sample(nobs, 0.70*nobs) length(train) ## [1]
229
• test <- setdiff(1:nobs, train) length(test)
• ## [1] 99
Random Forest
• library(randomForest) model <- randomForest(form,
ds[train, vars], na.action=na.omit) model
• ##
• ## Call:
• ## randomForest(formula=form, data=ds[train,
vars], ...
• ## Type of random forest: classification
• ## Number of trees: 500
• ## No. of variables tried at each split: 4 ....
Evaluate the Model – Risk Chart
• pr <- predict(model, ds[test,], type="prob")[,2]
riskchart(pr, ds[test, target], ds[test, risk],
• title="Random Forest - Risk Chart", risk=risk,
recall=target, thresholds=c(0.35, 0.15))
Linear Regression
• X: predictor variable
• Y: response variable
• Lm( y ~ x, data= dataframe)
Multiple Linear Regression
• Lm is used again
• Lm( y ~ x + u + v, data frame)
• It is better to keep the data in one data
frame because it is easier to manage.
Getting Regression Statistics
• Save the model to a variable:
• M <- lm(y ~ x + u + v)
• Then use regression statistics to get the values that you need
from m.
Getting Regression Statistics
• Anova(m)
• Coefficients(m) / coef(m)
• Confint(m)
• Effects(m)
• Fitted(m)
• Residuals(m)
Getting regression statistics
• The most important one is summary(m). It shows:
• Estimated coefficients
• Critical statistics such as R2 and the F statistic
• The output is hard to read so we will write it out to Excel.
Understanding the Regression Summary
• The model summary gives you the information for
the most important regression statistics, such as the
residuals, coefficients and the significance codes.
• The most important one is the F statistic.
• You can check the residuals whether they are a
normal distribution or not. How can you tell this?
Understanding the Regression Summary
• The direction of the median is important e.g. a
negative direction will tell you if there is a skew to
the left.
• The quartiles will also help. Ideally Q1 and Q3 should
have the same magnitude. If not, a skew has
developed. This could be inconsistent with the
median result.
• It helps us to identify outliers.
Coefficients and R
• The Estimate column contains estimated regression
coefficients, calculated using the least squares
method. This is the most common method.
• How likely is it that the coefficients are zero? This
only shows estimates. This is the purpose of the
column t and p ( > ¦ t¦)
Coefficients and R
• The p value is a probability that this finding is
significant. The lower, the better. We can look at the
column signif. codes to help us to identify the most
appropriate level of p value.
Coefficients and R
• R2 is the coefficient of determination. How
successful is the model? We look at this value.
Bigger is better. It is the variance of y that is
explained by the regression model. The remaining
variance is not explained by the model. The adjusted
value takes into account the number of variables in
the model.
First Impressions
• Plotting the model can help you to investigate it
further.
• Library(car)
• Outlier.test(m)
• M <- lm(y ~ m)
• Plot(m, which=1)
First Impressions?
• How do you go about it?
• Check the plot first; how does it look?
The F Statistic
• Is the model significant or insignificant? This is the purpose of
the F statistic.
• Check the F statistic first because if it is not significant, then the
model doesn’t matter.
Significance Stars
The stars are shorthand for significance levels,
with the number of asterisks
displayed according to the p-value computed.
*** for high significance and * for low significance. In
this case, *** indicates that it's unlikely that no
relationship exists b/w heights of parents and heights of
their children.
Plot the Predicted Values
• data2011 <- data.frame(year=2011, quarter=1:4)
• > cpi2011 <- predict(fit, newdata=data2011) > style
<- c(rep(1,12), rep(2,4))
• > plot(c(cpi, cpi2011), xaxt="n", ylab="CPI", xlab="",
pch=style, col=style)
• > axis(1, at=1:16, las=3, +
labels=c(paste(year,quarter,sep="Q"), "2011Q1",
"2011Q2", "2011Q3", "2011Q4"))
How to get Help
Microsoft Confidential 65
example(rnorm)
Rseek.org
Resources
• Introductory Statistics with R by Peter Dalgaard. Good for beginners.
• The Art of R Programming
• http://www.r-project.org
• CRAN sites – Comprehensive R Archive Network

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Data Culture Series - Keynote - 16th September 2014
Data Culture Series - Keynote - 16th September 2014Data Culture Series - Keynote - 16th September 2014
Data Culture Series - Keynote - 16th September 2014Jonathan Woodward
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Simplilearn
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringRy Walker
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldDez Blanchfield
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Data Discoverability at SpotHero
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHeroMaggie Hays
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Jen Stirrup
 
Data Visualization and Analysis
Data Visualization and AnalysisData Visualization and Analysis
Data Visualization and AnalysisDaniel Rangel
 
Documenting Data Transformations
Documenting Data TransformationsDocumenting Data Transformations
Documenting Data TransformationsARDC
 
Visual Analytics in Big Data
Visual Analytics in Big DataVisual Analytics in Big Data
Visual Analytics in Big DataSaurabh Shanbhag
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data ScienceDhiana Deva
 
Towards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsTowards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsAditya Parameswaran
 

Was ist angesagt? (20)

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Data Culture Series - Keynote - 16th September 2014
Data Culture Series - Keynote - 16th September 2014Data Culture Series - Keynote - 16th September 2014
Data Culture Series - Keynote - 16th September 2014
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Data Discoverability at SpotHero
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHero
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
 
Data Visualization and Analysis
Data Visualization and AnalysisData Visualization and Analysis
Data Visualization and Analysis
 
Documenting Data Transformations
Documenting Data TransformationsDocumenting Data Transformations
Documenting Data Transformations
 
Visual Analytics in Big Data
Visual Analytics in Big DataVisual Analytics in Big Data
Visual Analytics in Big Data
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Towards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsTowards Visualization Recommendation Systems
Towards Visualization Recommendation Systems
 

Andere mochten auch

Coursera predmachlearn 2016
Coursera predmachlearn 2016Coursera predmachlearn 2016
Coursera predmachlearn 2016Annette Taylor
 
Pons lumbierres eizaguirre albajes 2006 bsvp
Pons lumbierres eizaguirre albajes 2006 bsvpPons lumbierres eizaguirre albajes 2006 bsvp
Pons lumbierres eizaguirre albajes 2006 bsvpRete21. Huesca
 
Schneider Portfolio 2005 2007
Schneider Portfolio 2005   2007Schneider Portfolio 2005   2007
Schneider Portfolio 2005 2007David Schneider
 
Pons &amp; lumbierres 2010 bsvp
Pons &amp; lumbierres 2010 bsvpPons &amp; lumbierres 2010 bsvp
Pons &amp; lumbierres 2010 bsvpRete21. Huesca
 
Informacion del programa de mercadeo universidad de los llanos
Informacion del  programa de mercadeo universidad de los llanosInformacion del  programa de mercadeo universidad de los llanos
Informacion del programa de mercadeo universidad de los llanosmercadeounillanos
 
Ozon kasimov ekbpromo_kazan
Ozon kasimov ekbpromo_kazanOzon kasimov ekbpromo_kazan
Ozon kasimov ekbpromo_kazanekbpromo
 
Разработка моделей компетенций
Разработка моделей компетенцийРазработка моделей компетенций
Разработка моделей компетенцийАнастасия Смелова
 
cervical cancer
 cervical cancer cervical cancer
cervical cancermt53y8
 
Catalogo AZENKA - sandraluz
Catalogo AZENKA - sandraluzCatalogo AZENKA - sandraluz
Catalogo AZENKA - sandraluzSandra Luz
 
гігієна харчування
гігієна харчуваннягігієна харчування
гігієна харчуванняvaleria karnatovska
 
Alcohol y odontologia
Alcohol y odontologiaAlcohol y odontologia
Alcohol y odontologiaFedeVillani
 

Andere mochten auch (17)

ICD-10, Documentation-Education, Preparation and You-Demo
ICD-10, Documentation-Education, Preparation and You-DemoICD-10, Documentation-Education, Preparation and You-Demo
ICD-10, Documentation-Education, Preparation and You-Demo
 
Coursera predmachlearn 2016
Coursera predmachlearn 2016Coursera predmachlearn 2016
Coursera predmachlearn 2016
 
Fraser island
Fraser islandFraser island
Fraser island
 
Pons lumbierres eizaguirre albajes 2006 bsvp
Pons lumbierres eizaguirre albajes 2006 bsvpPons lumbierres eizaguirre albajes 2006 bsvp
Pons lumbierres eizaguirre albajes 2006 bsvp
 
Sarau literário-2016
Sarau literário-2016Sarau literário-2016
Sarau literário-2016
 
Schneider Portfolio 2005 2007
Schneider Portfolio 2005   2007Schneider Portfolio 2005   2007
Schneider Portfolio 2005 2007
 
Pons &amp; lumbierres 2010 bsvp
Pons &amp; lumbierres 2010 bsvpPons &amp; lumbierres 2010 bsvp
Pons &amp; lumbierres 2010 bsvp
 
Impression
ImpressionImpression
Impression
 
Informacion del programa de mercadeo universidad de los llanos
Informacion del  programa de mercadeo universidad de los llanosInformacion del  programa de mercadeo universidad de los llanos
Informacion del programa de mercadeo universidad de los llanos
 
Enfermería
EnfermeríaEnfermería
Enfermería
 
Ozon kasimov ekbpromo_kazan
Ozon kasimov ekbpromo_kazanOzon kasimov ekbpromo_kazan
Ozon kasimov ekbpromo_kazan
 
Разработка моделей компетенций
Разработка моделей компетенцийРазработка моделей компетенций
Разработка моделей компетенций
 
cervical cancer
 cervical cancer cervical cancer
cervical cancer
 
Catalogo AZENKA - sandraluz
Catalogo AZENKA - sandraluzCatalogo AZENKA - sandraluz
Catalogo AZENKA - sandraluz
 
гігієна харчування
гігієна харчуваннягігієна харчування
гігієна харчування
 
Alcohol y odontologia
Alcohol y odontologiaAlcohol y odontologia
Alcohol y odontologia
 
La UX en el Diseño Periodístico
La UX en el Diseño PeriodísticoLa UX en el Diseño Periodístico
La UX en el Diseño Periodístico
 

Ähnlich wie SQLBits Module 2 RStats Introduction to R and Statistics

Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
 
Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statisticsIBM
 
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...Rodger Devine
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computingBAINIDA
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folksThomas Hütter
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Andy Lathrop
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vsIan Feller
 
MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra QUONTRASOLUTIONS
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax
 

Ähnlich wie SQLBits Module 2 RStats Introduction to R and Statistics (20)

Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statistics
 
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Using the LEADing Data Reference Content
Using the LEADing Data Reference ContentUsing the LEADing Data Reference Content
Using the LEADing Data Reference Content
 
An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folks
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
Data Mining
Data MiningData Mining
Data Mining
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
EDA
EDAEDA
EDA
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 
EDA.pptx
EDA.pptxEDA.pptx
EDA.pptx
 

Mehr von Jen Stirrup

AI Applications in Healthcare and Medicine.pdf
AI Applications in Healthcare and Medicine.pdfAI Applications in Healthcare and Medicine.pdf
AI Applications in Healthcare and Medicine.pdfJen Stirrup
 
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATIONBUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATIONJen Stirrup
 
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...Jen Stirrup
 
1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for releaseJen Stirrup
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for AnalyticsJen Stirrup
 
Comparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform TechnologiesComparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform TechnologiesJen Stirrup
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonJen Stirrup
 
Sales Analytics in Power BI
Sales Analytics in Power BISales Analytics in Power BI
Sales Analytics in Power BIJen Stirrup
 
Analytics for Marketing
Analytics for MarketingAnalytics for Marketing
Analytics for MarketingJen Stirrup
 
Diversity and inclusion for the newbies and doers
Diversity and inclusion for the newbies and doersDiversity and inclusion for the newbies and doers
Diversity and inclusion for the newbies and doersJen Stirrup
 
Artificial Intelligence from the Business perspective
Artificial Intelligence from the Business perspectiveArtificial Intelligence from the Business perspective
Artificial Intelligence from the Business perspectiveJen Stirrup
 
How to be successful with Artificial Intelligence - from small to success
How to be successful with Artificial Intelligence - from small to successHow to be successful with Artificial Intelligence - from small to success
How to be successful with Artificial Intelligence - from small to successJen Stirrup
 
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...Jen Stirrup
 
Data Visualization dataviz superpower
Data Visualization dataviz superpowerData Visualization dataviz superpower
Data Visualization dataviz superpowerJen Stirrup
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsJen Stirrup
 
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and TensorflowArtificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and TensorflowJen Stirrup
 
Blockchain Demystified for Business Intelligence Professionals
Blockchain Demystified for Business Intelligence ProfessionalsBlockchain Demystified for Business Intelligence Professionals
Blockchain Demystified for Business Intelligence ProfessionalsJen Stirrup
 
Examples of the worst data visualization ever
Examples of the worst data visualization everExamples of the worst data visualization ever
Examples of the worst data visualization everJen Stirrup
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Digital Transformation for the Human Resources Leader
Digital Transformation for the Human Resources LeaderDigital Transformation for the Human Resources Leader
Digital Transformation for the Human Resources LeaderJen Stirrup
 

Mehr von Jen Stirrup (20)

AI Applications in Healthcare and Medicine.pdf
AI Applications in Healthcare and Medicine.pdfAI Applications in Healthcare and Medicine.pdf
AI Applications in Healthcare and Medicine.pdf
 
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATIONBUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
 
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
 
1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Comparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform TechnologiesComparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform Technologies
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Sales Analytics in Power BI
Sales Analytics in Power BISales Analytics in Power BI
Sales Analytics in Power BI
 
Analytics for Marketing
Analytics for MarketingAnalytics for Marketing
Analytics for Marketing
 
Diversity and inclusion for the newbies and doers
Diversity and inclusion for the newbies and doersDiversity and inclusion for the newbies and doers
Diversity and inclusion for the newbies and doers
 
Artificial Intelligence from the Business perspective
Artificial Intelligence from the Business perspectiveArtificial Intelligence from the Business perspective
Artificial Intelligence from the Business perspective
 
How to be successful with Artificial Intelligence - from small to success
How to be successful with Artificial Intelligence - from small to successHow to be successful with Artificial Intelligence - from small to success
How to be successful with Artificial Intelligence - from small to success
 
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
 
Data Visualization dataviz superpower
Data Visualization dataviz superpowerData Visualization dataviz superpower
Data Visualization dataviz superpower
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and TensorflowArtificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
 
Blockchain Demystified for Business Intelligence Professionals
Blockchain Demystified for Business Intelligence ProfessionalsBlockchain Demystified for Business Intelligence Professionals
Blockchain Demystified for Business Intelligence Professionals
 
Examples of the worst data visualization ever
Examples of the worst data visualization everExamples of the worst data visualization ever
Examples of the worst data visualization ever
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Digital Transformation for the Human Resources Leader
Digital Transformation for the Human Resources LeaderDigital Transformation for the Human Resources Leader
Digital Transformation for the Human Resources Leader
 

Kürzlich hochgeladen

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

SQLBits Module 2 RStats Introduction to R and Statistics

  • 1. The Data Analyst’s Toolkit Introduction to R Jen Stirrup | Data Relish Ltd| June, 2014 Jen.Stirrup@datarelish.com
  • 2. Note • This presentation was part of a full day workshop on Power BI and R, held at SQLBits in 2014 • This is a sample, provided to help you see if my one day Business Intelligence Masterclass is the right course for you. • http://bit.ly/BusinessIntelligence2016Masterclass • In that course, you’ll be given updated notes along with a hands-on session, so why not join me? 2
  • 3. Course Outline • Module 1: Setting up your data for R with Power Query • Module 2: Introducing R • Module 3: The Big Picture: Putting Power BI and R together • Module 4: Visualising your data with Power View and Excel 2013 • Module 5: Power Map • Module 6: Wrap up and Q and Q
  • 4. What is R? 4 • R is a powerful environment for statistical computing • It is an overgrown calculator • … which lets you save results in variables x <- 3 y <- 5 z = 4 x + y + z
  • 5. Vectors in R 5 • create a vector (list) of elements, use the "c" operator v = c("hello","world","welcome","to","the class.") v = seq(1,100) v[1] v[1:10] • Subscripting in R square brackets operators allow you to extract values: • insert logical expressions in the square brackets to retrieve subsets of data from a vector or list. For example:
  • 6. Vectors in R Microsoft Confidential 6 v = seq(1,100) logi = v>95 logi v[logi] v[v<6] v[105]=105 v[is.na(v)]
  • 7. Save and Load RData Data is saved in R as .Rdata files Imported back again with load a <- 1:10 save(a, file = "E:/MyData.Rdata") rm(a) load("E:/MyData.Rdata") print(a)
  • 8. Import From CSV Files • A simple way to load in data is to read in a CSV. • Read.csv() • MyDataFrame <- read.csv(“filepath.csv") • print(MyDataFrame)
  • 9. Import From CSV Files • Go to Tools in RStudio, and select Import Dataset. • Select the file CountryCodes.csv and select the Import button. • In RStudio, you will now see the data in the data pane.
  • 10. Import From CSV Files The console window will show the following: > #import dataset > CountryCodes <- read.csv("C:/Program Files/R/R- 3.1.0/Working Directory/CountryCodes.csv", header=F) > View(CountryCodes) Once the data is imported, we can check the data. dim(CountryCodes) head(CountryCodes) tail(CountryCodes)
  • 11. Import / Export via ODBC • The Package RODBC provides R with a connection to ODBC databases • library(RODBC) • myodbcConnect <- odbcConnect(dsn="servername",uid="us erid",pwd="******")
  • 12. Import / Export via ODBC • Query <- "SELECT * FROM lib.table WHERE ..." • # or read query from file myQuery <- readChar("E:/MyQueries/myQuery.sql", nchars=99999) myData <- sqlQuery(myodbcConnect, myQuery, errors=TRUE) odbcCloseAll()
  • 13. Import/Export from Excel Files • RODBC also works for importing data from Excel files • library(RODBC) • filename <- "E:/Rtmp/dummmyData.xls" • myxlsFile <- odbcConnectExcel(filename, readOnly = FALSE) • sqlSave(myxlsFile, a, rownames = FALSE) • b <- sqlFetch(myxlsFile, "a") • odbcCloseAll()
  • 14. Anscombe’s Quartet Property Value Mean of X 9 Variance of X 11 Mean of Y 7.50 Variance of Y 4.1 Correlation 0.816 Linear Regression Y = 3.00 + 0.5 14
  • 15. What does Anscombe’s Quartet look like? 15
  • 17. So, it is correct? 17
  • 18. Correlation r = 0.96 18 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Number of people who died by becoming tangled in their bedsheets Deaths (US) (CDC) 327 456 509 497 596 573 661 741 809 717 Total revenue generated by skiing facilities (US) Dollars in millions (US Census) 1,551 1,635 1,801 1,827 1,956 1,989 2,178 2,257 2,476 2,438
  • 19. R and Power BI together • Pivot Tables are not always enough • Scaling Data (ScaleR) • R is very good at static data visualisation • Upworthy 19
  • 20. Why R? • most widely used data analysis software - used by 2M + data scientist, statisticians and analysts • Most powerful statistical programming language • flexible, extensible and comprehensive for productivity • Create beautiful and unique data visualisations - as seen in New York Times, Twitter and Flowing Data • Thriving open-source community - leading edge of analytics research • Fills the talent gap - new graduates prefer R. 20
  • 21. Growth in Demand • Rexer Data Mining survey, 2007 - 2013 • R is the highest paid IT skill Dice.com, Jan 2014 • R most used-data science language after SQL - O'Reilly, Jan 2014 • R is used by 70% of data miners. Rexer, Sept 2013 21
  • 22. Growth in Demand • R is #15 of all programming languages. • RedMonk, Jan 2014 • R growing faster than any other data science language. • KDNuggets. • R is in-memory and limited in the size of data that you can process. 22
  • 23. What are we testing? • We have one or two samples and a hypothesis, which may be true or false. • The NULL hypothesis – nothing happened. • The Alternative hypothesis – something did happen. 23
  • 24. Strategy • We set out to prove that something did happen. • We look at the distribution of the data. • We choose a test statistic • We look at the p value 24
  • 25. How small is too small? • How do we know when the p-value is small? • P => 0.05 – Null hypothesis is true • P < 0.05 – alternative hypothesis is true • it depends • For high-risk, then perhaps we want 0.01 or even 0.001. 25
  • 26. Confidence Intervals • Basically, how confident are you that you can extrapolate from your little data set to the larger population? • We can look at the mean • To do this, we run a t.test • t.test(vector) 26
  • 27. Confidence Intervals • Basically, how confident are you that you can extrapolate from your little data set to the larger population? • We can look at the median • To do this, we run a Wilcox test. • t.test(vector) 27
  • 28. Calculate the relative frequency • How much is above, or below the mean? • Mean(after > before) • Mean(abs(x-mean)) < 2 *sd(s) • This gives you the fraction of data that is greater than two standard deviations from the mean. 28
  • 29. Testing Categorical Variables for Independence • Chi squares – are two variables independent? Are they connected in some way? • Summarise the data first: Summary(table(initial, outcome)) • chisq.test 29
  • 30. How Statistics answers your question • Is our model significant or insignificant? – The F Statistic • What is the quality of the model? – R2 statistic • How well do the data points fit the model? – R2 statistic
  • 31. What do the values mean together? The type of analysis Test statistic How can you tell if it is significant? What is the assumption you can make? Regression analysis F Big F, Small p < 0.05 A general relationship between the predictors and the response Regression Analysis t Big t (> +2.0 or < -2.0), small p < 0.05 X is an important predictor Difference of means t (two tailed) Big t (> +2.0 or < -2.0), small p < 0.05 Significant difference of means Difference of means t (one tailed) Big t (> +2.0 or < -2.0), small p < 0.05 Significant difference of means 31
  • 32. What is Regression? Using predictors to predict a response Using independent variables to predict a dependent variable Example: Credit score is a response, predicted by spend, income, location and so on.
  • 33. Linear Regression using World Bank data We can look at predicting using World Bank data Year <- GDP <- (wdiData, ) Plot(wdiData, Cor(year, wdiData) Fit <- lm(cpi ~ year+quarter) Fit
  • 34. Examples of Data Mining in R  cpi2011 <- fit$coefficients[[1]] + fit$coefficients[[2]]*2011 + fit$coefficients[[3]]*(1:4) attributes(fit) fit$coefficients Residuals(fit) – difference between observed and fitted values Summary(fit) Plot(fit)
  • 35. What is Data Mining Machine Learning Statistics Software Engineering and Programming with Data Intuition Fun!
  • 36. The Why of Data Mining to discover new knowledge to improve business outcomes to deliver better customised services
  • 37. Examples of Data Mining in R Logistic Regression (glm) Decision Trees (rpart, wsrpart) Random Forests (randomForest, wsrf) Boosted Stumps (ada) Neural Networks (nnet) Support Vector Machines (kernlab)
  • 38. Examples of Data Mining in R • Packages: – fpc – cluster – pvclust – mclust • Partitioning-based clustering: kmeans, pam, pamk, clara • Hierarchical clustering: hclust, pvclust, agnes, Diana • Model-based clustering: mclust • Density-based clustering: dbscan • Plotting cluster solutions: plotcluster, plot.hclust • Validating cluster solutions: cluster.stats
  • 39. How can we make it easier? • AzureML
  • 40. The Data Mining Process • Load data • Choose your variables • Sample the data into test and training sets (usually 70/30 split) • Explore the distributions of the data • Test some distributions • Transform the data if required • Build clusters with the data • Build a model • Evaluate the model • Log the data process for auditing externally
  • 41. Loading the Data • Dsname is the name of our dataset • Get(dsname) • Dim(ds) • Names(ds)
  • 42. Explore the data • Head(dataset) • Tail(dataset)
  • 43. Explore the data’s structure • Str(dataset) • Summary(dataset)
  • 44. Pick out the Variables • Id <- c(“Date”, “Location) target <- “RainTomorrow” risk <- “RISK_MM” • (ignore <-union(id, risk)) • (vars <- setdiff(names(ds), ignore))
  • 45. Remove Missing Data • dim(ds) ## [1] 366 24 sum(is.na(ds[vars])) • ## [1] 47 ds <- ds[-attr(na.omit(ds[vars]), "na.action"),] • dim(ds) ## [1] 328 24 sum(is.na(ds[vars])) • ## [1] 0
  • 46. Clean Data Target as Categorical Data • summary(ds[target]) • ## RainTomorrow ## Min. :0.000 ## 1st Qu.:0.000 • ## Median :0.000 ## Mean :0.183 ## 3rd Qu.:0.000 ## Max. :1.000 • .... • ds[target] <- as.factor(ds[[target]]) levels(ds[target]) <- c("No", "Yes") • summary(ds[target])
  • 47. Model Preparation • (form <- formula(paste(target, "~ ."))) ## RainTomorrow ~ . • (nobs <- nrow(ds)) ## [1] 328 • train <- sample(nobs, 0.70*nobs) length(train) ## [1] 229 • test <- setdiff(1:nobs, train) length(test) • ## [1] 99
  • 48. Random Forest • library(randomForest) model <- randomForest(form, ds[train, vars], na.action=na.omit) model • ## • ## Call: • ## randomForest(formula=form, data=ds[train, vars], ... • ## Type of random forest: classification • ## Number of trees: 500 • ## No. of variables tried at each split: 4 ....
  • 49. Evaluate the Model – Risk Chart • pr <- predict(model, ds[test,], type="prob")[,2] riskchart(pr, ds[test, target], ds[test, risk], • title="Random Forest - Risk Chart", risk=risk, recall=target, thresholds=c(0.35, 0.15))
  • 50. Linear Regression • X: predictor variable • Y: response variable • Lm( y ~ x, data= dataframe)
  • 51. Multiple Linear Regression • Lm is used again • Lm( y ~ x + u + v, data frame) • It is better to keep the data in one data frame because it is easier to manage.
  • 52. Getting Regression Statistics • Save the model to a variable: • M <- lm(y ~ x + u + v) • Then use regression statistics to get the values that you need from m.
  • 53. Getting Regression Statistics • Anova(m) • Coefficients(m) / coef(m) • Confint(m) • Effects(m) • Fitted(m) • Residuals(m)
  • 54. Getting regression statistics • The most important one is summary(m). It shows: • Estimated coefficients • Critical statistics such as R2 and the F statistic • The output is hard to read so we will write it out to Excel.
  • 55. Understanding the Regression Summary • The model summary gives you the information for the most important regression statistics, such as the residuals, coefficients and the significance codes. • The most important one is the F statistic. • You can check the residuals whether they are a normal distribution or not. How can you tell this?
  • 56. Understanding the Regression Summary • The direction of the median is important e.g. a negative direction will tell you if there is a skew to the left. • The quartiles will also help. Ideally Q1 and Q3 should have the same magnitude. If not, a skew has developed. This could be inconsistent with the median result. • It helps us to identify outliers.
  • 57. Coefficients and R • The Estimate column contains estimated regression coefficients, calculated using the least squares method. This is the most common method. • How likely is it that the coefficients are zero? This only shows estimates. This is the purpose of the column t and p ( > ¦ t¦)
  • 58. Coefficients and R • The p value is a probability that this finding is significant. The lower, the better. We can look at the column signif. codes to help us to identify the most appropriate level of p value.
  • 59. Coefficients and R • R2 is the coefficient of determination. How successful is the model? We look at this value. Bigger is better. It is the variance of y that is explained by the regression model. The remaining variance is not explained by the model. The adjusted value takes into account the number of variables in the model.
  • 60. First Impressions • Plotting the model can help you to investigate it further. • Library(car) • Outlier.test(m) • M <- lm(y ~ m) • Plot(m, which=1)
  • 61. First Impressions? • How do you go about it? • Check the plot first; how does it look?
  • 62. The F Statistic • Is the model significant or insignificant? This is the purpose of the F statistic. • Check the F statistic first because if it is not significant, then the model doesn’t matter.
  • 63. Significance Stars The stars are shorthand for significance levels, with the number of asterisks displayed according to the p-value computed. *** for high significance and * for low significance. In this case, *** indicates that it's unlikely that no relationship exists b/w heights of parents and heights of their children.
  • 64. Plot the Predicted Values • data2011 <- data.frame(year=2011, quarter=1:4) • > cpi2011 <- predict(fit, newdata=data2011) > style <- c(rep(1,12), rep(2,4)) • > plot(c(cpi, cpi2011), xaxt="n", ylab="CPI", xlab="", pch=style, col=style) • > axis(1, at=1:16, las=3, + labels=c(paste(year,quarter,sep="Q"), "2011Q1", "2011Q2", "2011Q3", "2011Q4"))
  • 65. How to get Help Microsoft Confidential 65 example(rnorm) Rseek.org
  • 66. Resources • Introductory Statistics with R by Peter Dalgaard. Good for beginners. • The Art of R Programming • http://www.r-project.org • CRAN sites – Comprehensive R Archive Network