SlideShare ist ein Scribd-Unternehmen logo
1 von 92
Downloaden Sie, um offline zu lesen
An introduction to R 
Version 1.4 
November 2014 
h=p://datasciencebusiness.wordpress.com 
Prof. 
Richard 
Vidgen 
Management 
Systems 
Hull 
University 
Business 
School 
E: 
r.vidgen@hull.ac.uk
Aims 
and 
use 
of 
this 
presentaDon 
This 
presentaDon 
provides 
an 
introducDon 
to 
R 
for 
anyone 
who 
wants 
to 
get 
an 
idea 
of 
what 
R 
is 
and 
what 
it 
can 
do. 
Although 
no 
prior 
experience 
of 
R 
is 
needed, 
a 
basic 
understanding 
of 
data 
analysis 
(e.g., 
mulDple 
regression) 
is 
assumed, 
as 
is 
a 
basic 
technical 
competence 
(e.g., 
installing 
soKware 
and 
managing 
directory 
structures). 
If 
you 
are 
already 
using 
SPSS 
you 
will 
get 
a 
feel 
for 
how 
it 
compares 
with 
R. 
It’s 
a 
work 
in 
progress 
and 
will 
be 
updated 
based 
on 
experience 
and 
feedback. 
Docendo 
discimus, 
(LaDn 
"by 
teaching, 
we 
learn").
Contents 
• PredicDve 
analyDcs 
and 
analyDcs 
tools 
• An 
overview 
of 
R 
• Installing 
R 
and 
an 
R 
IDE 
(integrated 
development 
environment) 
• R 
syntax 
and 
data 
types 
• MulDple 
regression 
in 
R 
and 
SPSS 
• The 
Twi=eR 
package 
• The 
Rfacebook 
package
Resources 
• The 
R 
source 
and 
data 
files 
for 
this 
tutorial 
can 
be 
accessed 
at: 
– h=p://datasciencebusiness.wordpress.com 
• R_intro.R 
• R_regression.R 
• R_twi=er.R 
• R_twi=er_senDment.R 
• senDment.R 
• R_facebook.R 
• insurance.csv 
• twi=erHull.csv 
• posiDve-­‐words.txt 
• negaDve-­‐words.txt 
WARNING: 
when 
R 
packages 
are 
updated 
by 
developers 
things 
can 
break 
and 
your 
R 
programs 
may 
stop 
working. 
The 
code 
in 
the 
files 
above 
is 
tested 
regularly 
with 
the 
latest 
packages 
and 
updated 
as 
necessary
Predictive analytics
Be=er 
decisions 
-­‐ 
predicDve 
analyDcs 
• A 
predicDve 
model 
that 
calculates 
strawberry 
purchases 
based 
on: 
– Weather 
forecast 
– Store 
temperature 
– Freezer 
sensor 
data 
– Remaining 
stock 
per 
shelf 
life 
– Sales 
transacDon 
point 
of 
sale 
feeds 
– Web 
searches, 
social 
menDons 
h=p://www.slideshare.net/datasciencelondon/big-­‐data-­‐sorry-­‐data-­‐science-­‐what-­‐does-­‐a-­‐data-­‐scienDst-­‐do
PredicDve 
analyDcs 
• For 
example, 
what 
data 
might 
help 
us 
predict 
which 
students 
will 
drop 
out? 
– Assessment 
grades 
at 
University 
– Prior 
educaDon 
a=ainment 
– Social 
background 
– Distance 
of 
home 
from 
University 
– Friendship 
circles 
and 
networks 
(e.g., 
sports 
club 
memberships) 
– A=endance 
at 
lectures 
and 
tutorials 
– InteracDon 
in 
lectures 
and 
tutorials 
– Time 
spent 
on 
campus 
– Time 
spent 
in 
library 
– Number 
of 
accesses 
to 
electronic 
learning 
resources 
– Text 
books 
purchased 
– Engagement 
in 
subject-­‐related 
forums 
– SenDment 
of 
social 
media 
posts 
– Etc.
h=p://www.slideshare.net/datasciencelondon/big-­‐data-­‐sorry-­‐data-­‐science-­‐what-­‐does-­‐a-­‐data-­‐scienDst-­‐do
Some 
of 
the 
techniques 
data 
scienDsts 
use 
• ClassificaDon 
• Clustering 
• AssociaDon 
rules 
• Decision 
trees 
• Regression 
• GeneDc 
algorithms 
• Neural 
networks 
and 
support 
vector 
machines 
• Machine 
learning 
• Natural 
language 
processing 
• SenDment 
analysis 
• ArDficial 
intelligence 
• Time 
series 
analysis 
• SimulaDons 
• Social 
network 
analysis
Technologies 
for 
data 
analysis: 
usage 
rates 
King, 
J., 
& 
R. 
Magoulas 
(2013). 
Data 
Science 
Salary 
Survey. 
O’Reilly 
Media. 
R 
and 
Python 
programming 
languages 
come 
above 
Excel 
Enterprise 
products 
bo=om 
of 
the 
heap
Data 
scienDst 
as 
“bricoleur” 
“In 
the 
pracDcal 
arts 
and 
the 
fine 
arts, 
bricolage 
(French 
for 
"Dnkering") 
is 
the 
construcDon 
or 
creaDon 
of 
a 
work 
from 
a 
diverse 
range 
of 
things 
that 
happen 
to 
be 
available, 
or 
a 
work 
created 
by 
such 
a 
process.” 
Wikipedia
The R environment
What 
is 
R? 
R 
is 
an 
open 
source 
computer 
language 
used 
for 
data 
manipulaDon, 
staDsDcs, 
and 
graphics.
History 
of 
R 
• 1976 
– 
Bell 
Labs 
develops 
S, 
a 
language 
for 
data 
analysis; 
released 
commercially 
as 
S-­‐plus 
• 1990s 
– 
R 
wri=en 
and 
released 
as 
open 
source 
by 
(R)oss 
Ihaka 
and 
(R)obert 
Gentleman 
• 1997 
– 
The 
Comprehensive 
R 
Archive 
Network 
(CRAN) 
launched 
• August 
2014 
– 
CRAN 
repository 
contains 
5789 
user-­‐contributed 
packages
Benefits 
of 
R 
• It’s 
free! 
• Runs 
on 
mulDple 
plaqorms 
(Windows, 
Unix, 
MacOS) 
• ValidaDon/replicaDon 
of 
analyses 
(assumes 
commented 
code 
and 
documentaDon) 
• Long 
term 
efficiency 
(using 
the 
same 
code 
for 
mulDple 
projects)
SPSS* 
vs 
R 
SPSS 
• Limited 
ability 
for 
data 
scienDst 
to 
change 
the 
environment 
• Data 
scienDst 
relies 
on 
algorithms 
developed 
by 
SPSS 
• Problem-­‐solving 
constrained 
by 
SPSS 
developers 
• Must 
pay 
for 
using 
the 
constrained 
algorithms 
R 
• Can 
use 
funcDons 
made 
by 
a 
global 
community 
of 
staDsDcs 
researchers 
or 
create 
their 
own 
• Almost 
unlimited 
in 
their 
ability 
to 
change 
their 
environment 
• Can 
do 
things 
SPSS 
users 
cannot 
even 
dream 
of 
• Get 
all 
this 
for 
free 
*or 
any 
other 
proprietary 
closed 
soKware 
system
h=p://www.r-­‐project.org 
Install 
R 
from 
here
The 
R 
console
2. 
Output 
appears 
here 
1. 
Type 
in 
commands, 
select 
the 
text 
and 
run 
with 
(cmd 
+ 
return) 
or 
menu 
opDon: 
edit 
| 
execute
R 
integrated 
development 
environments 
(IDEs) 
• Some 
free 
IDEs 
– RevoluDon 
R 
Enterprise 
– Architect 
– R 
Studio 
• Most 
widely 
used 
R 
IDE 
• It’s 
simple 
and 
intuiDve 
• Used 
to 
build 
this 
tutorial
RevoluDon 
analyDcs 
h=p://www.revoluDonanalyDcs.com
Architect 
h=p://www.openanalyDcs.eu
R 
Studio 
h=p://www.rstudio.com 
Install 
R 
Studio 
from 
here
R 
Studio 
Type 
code 
here 
Results 
appear 
here 
Environment 
and 
history 
Packages, 
plots, 
files
The R language
Basic 
grammar 
of 
R 
object 
= 
funcDon(arguments)
Guess 
what 
this 
does 
Z 
<-­‐ 
read.table(“MyFile.txt”)
Two 
ways 
of 
doing 
it 
= 
is 
the 
same 
as 
<-­‐
Gezng 
help 
help(getwd)
Reading 
the 
slides 
# 
Comments 
are 
in 
blue 
<-­‐ 
Code 
is 
in 
green 
Output 
is 
in 
black
Set 
the 
working 
directory 
# 
For 
the 
tutorial, 
load 
all 
the 
R 
program 
and 
data 
files 
provided 
(see 
Resources 
slide) 
# 
into 
a 
directory 
of 
your 
choice 
# 
Set 
your 
working 
directory 
to 
this 
directory, 
e.g., 
for 
a 
Mac 
setwd("/Users/ 
… 
somewhere 
on 
your 
computer 
… 
/R_tutorial") 
# 
and 
for 
Windows 
Setwd("C:/ 
… 
somewhere 
on 
your 
computer 
… 
/R_tutorial") 
# 
List 
the 
files 
in 
the 
directory 
list.files() 
> 
list.files() 
[1] 
"insurance.csv" 
"negaDve-­‐words.txt" 
"posiDve-­‐words.txt" 
[4] 
"R_facebook.R" 
"R_intro.R" 
"R_regression.R" 
[7] 
"R_twi=er_senDment.R" 
"R_twi=er.R" 
"senDment.R" 
[10] 
"twi=erHull.csv"
Data 
types 
and 
data 
structures 
Data 
types 
Numeric 
Character 
Logical 
Data 
structures 
Vectors 
Lists 
MulD-­‐dimensional 
Matrices 
Dataframes
Vectors 
and 
data 
classes 
Values 
can 
be 
combined 
into 
vectors 
using 
the 
c() 
funcDon 
# this is a comment! 
num.var <- c(1, 2, 3, 4) # numeric vector! 
char.var <- c("1", "2", "3", "4") # character vector! 
log.var <- c(TRUE, TRUE, FALSE, TRUE) # logical vector! 
Vectors 
have 
a 
class 
which 
determines 
how 
funcDons 
treat 
them 
> class(num.var)! 
[1] "numeric"! 
> class(char.var)! 
[1] "character"! 
> class(log.var)! 
[1] "logical"!
Vectors 
and 
data 
classes 
Can 
calculate 
mean 
of 
a 
numeric 
vector, 
but 
not 
of 
a 
character 
vector 
> mean(num.var)! 
[1] 2.5! 
> mean(char.var)! 
[1] NA! 
Warning message:! 
In mean.default(char.var) :! 
argument is not numeric or 
logical: returning NA!
Lists 
# 
create 
a 
list 
-­‐ 
a 
collecDon 
of 
vectors 
employees 
<-­‐ 
c("John", 
"Sunil", 
"Anna") 
yearsService 
<-­‐ 
c(3, 
2, 
6) 
empDetails 
<-­‐ 
list(employees, 
yearsService) 
class(empDetails) 
empDetails 
> 
class(empDetails) 
[1] 
"list" 
> 
empDetails 
[[1]] 
[1] 
"John" 
"Sunil" 
"Anna" 
[[2]] 
[1] 
3 
2 
6
Dataframes 
A 
data.frame 
is 
a 
list 
of 
vectors, 
each 
of 
the 
same 
length 
DF <- data.frame(x=1:5, 
y=letters[1:5], z=letters[6:10])! 
> DF # data.frame with 3 columns and 
5 rows! 
x y Z! 
1 1 a f! 
2 2 b g! 
3 3 c h! 
4 4 d i! 
5 5 e j!
Multiple regression in R
insurance.csv 
• insurance.csv 
contains 
medical 
expenses 
for 
paDents 
enrolled 
in 
a 
healthcare 
plan 
• The 
data 
file 
contains 
1,338 
cases 
with 
features 
of 
the 
paDent 
as 
well 
as 
the 
total 
medical 
expenses 
charged 
to 
the 
paDent’s 
healthcare 
plan 
for 
the 
calendar 
year 
• There 
are 
no 
missing 
values 
(these 
would 
be 
shown 
as 
NA 
in 
R 
indicaDng 
empty 
or 
null)
insurance.csv 
Variable 
Descrip=on 
age 
an 
integer 
indicaDng 
the 
age 
of 
the 
beneficiary 
sex 
Either 
“male” 
or 
“female” 
bmi 
body 
mass 
index 
(BMI), 
which 
gives 
an 
indicaDon 
of 
how 
over 
or 
under-­‐ 
weight 
a 
person 
is. 
BMI 
is 
calculated 
as 
weight 
in 
kilograms 
divideD 
by 
height 
in 
metres 
squared. 
An 
ideal 
BMI 
is 
in 
the 
range 
18.5 
to 
24.9 
children 
an 
integer 
showing 
the 
number 
of 
children/dependents 
covered 
by 
the 
plan 
smoker 
“yes” 
or 
“no” 
region 
the 
beneficiary’s 
place 
of 
residence, 
divided 
into 
four 
regions: 
“northeast”, 
“southeast”, 
“southwest”, 
or 
“northwest”. 
This 
example 
is 
taken 
from 
Lantz 
(2013)
Read 
the 
data 
insurance 
<-­‐ 
read.csv("insurance.csv", 
stringsAsFactors 
= 
TRUE) 
head(insurance) 
age sex bmi children smoker region charges! 
1 19 female 27.900 0 yes southwest 16884.924! 
2 18 male 33.770 1 no southeast 1725.552! 
3 28 male 33.000 3 no southeast 4449.462! 
4 33 male 22.705 0 no northwest 21984.471! 
5 32 male 28.880 0 no northwest 3866.855! 
6 31 female 25.740 0 no southeast 3756.622!
Working 
with 
data 
• There 
are 
specific 
funcDons 
for 
reading 
from 
(and 
wriDng 
to) 
excel, 
SPSS, 
SAS, 
etc. 
• However, 
the 
simplest 
way 
is 
to 
export 
and 
import 
files 
in 
csv 
(comma 
separated 
values) 
format 
– 
the 
lingua 
franca 
of 
data
Explore 
the 
data 
summary(insurance$charges)! 
Min. 1st Qu. Median Mean 3rd Qu. Max. ! 
1122 4740 9382 13270 16640 63770 ! 
table(insurance$region)! 
northeast northwest southeast southwest ! 
324 325 364 325 ! 
cor(insurance[c("age", "bmi", "children", 
"charges")])! 
age bmi children charges! 
age 1.0000000 0.1092719 0.04246900 0.29900819! 
bmi 0.1092719 1.0000000 0.01275890 0.19834097! 
children 0.0424690 0.0127589 1.00000000 0.06799823! 
charges 0.2990082 0.1983410 0.06799823 1.00000000!
Visualise 
the 
data 
hist(insurance$charges)
Visualise 
the 
data 
pairs(insurance[c("age", 
"bmi", 
"children", 
"charges")])
Visualise 
the 
data 
-­‐ 
be=er 
library(psych) 
pairs.panels(insurance[c("age", 
"bmi", 
"children", 
"charges")], 
hist.col="yellow”)
Installing 
packages 
• pairs.panels 
is 
a 
funcDon 
in 
the 
psych 
package, 
which 
needs 
to 
be 
installed:
MulDple 
regression 
1 
ins_model1 <- lm(charges ~ age + 
children + bmi, data = insurance)! 
Residuals:! 
Min 1Q Median 3Q Max ! 
-13884 -6994 -5092 7125 48627 ! 
! 
Coefficients:! 
11.8% 
of 
variaDon 
in 
insurance 
charges 
is 
explained 
by 
the 
model 
Estimate Std. Error t value Pr(>|t|) ! 
(Intercept) -6916.24 1757.48 -3.935 8.74e-05 ***! 
age 239.99 22.29 10.767 < 2e-16 ***! 
children 542.86 258.24 2.102 0.0357 * ! 
bmi 332.08 51.31 6.472 1.35e-10 ***! 
---! 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1! 
! 
Residual standard error: 11370 on 1334 degrees of freedom! 
Multiple R-squared: 0.1201, !Adjusted R-squared: 0.1181 ! 
F-statistic: 60.69 on 3 and 1334 DF, p-value: < 2.2e-16!
SPSS
MulDple 
regression 
2 
ins_model2 <- lm(charges ~ age + children + bmi + 
sex + smoker + region, data = insurance)! 
Residuals:! 
Min 1Q Median 3Q Max ! 
-11304.9 -2848.1 -982.1 1393.9 29992.8 ! 
! 
Coefficients:! 
74.9% 
of 
variaDon 
in 
insurance 
charges 
is 
explained 
by 
the 
model 
Estimate Std. Error t value Pr(>|t|) ! 
(Intercept) -11938.5 987.8 -12.086 < 2e-16 ***! 
age 256.9 11.9 21.587 < 2e-16 ***! 
children 475.5 137.8 3.451 0.000577 ***! 
bmi 339.2 28.6 11.860 < 2e-16 ***! 
sexmale -131.3 332.9 -0.394 0.693348 ! 
smokeryes 23848.5 413.1 57.723 < 2e-16 ***! 
regionnorthwest -353.0 476.3 -0.741 0.458769 ! 
regionsoutheast -1035.0 478.7 -2.162 0.030782 * ! 
regionsouthwest -960.0 477.9 -2.009 0.044765 * ! 
---! 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1! 
! 
Residual standard error: 6062 on 1329 degrees of freedom! 
Multiple R-squared: 0.7509, !Adjusted R-squared: 0.7494 ! 
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16!
Dummy 
variables 
• R 
coded 
character 
vectors 
as 
factors 
and 
automaDcally 
analysed 
them 
as 
dummy 
variables 
– stringsAsFactors 
= 
TRUE 
• In 
SPSS 
these 
need 
to 
be 
coded 
by 
hand
SPSS 
-­‐ 
create 
dummy 
variables 
• Recode 
sex 
• 0 
= 
female 
• 1 
= 
male 
• Recode 
smoker 
– 1 
= 
yes 
– 0 
= 
no 
• Recode 
region 
– Number 
of 
dummies 
= 
number 
of 
groups 
– 
1 
– = 
4 
– 
1 
= 
3 
• northeast 
= 
0, 
0, 
0 
• northwest 
= 
1, 
0, 
0 
• southeast 
= 
0, 
1, 
0 
• southwest 
= 
0, 
0, 
1
Recode
Recode
Paste 
the 
syntax
Data 
set 
with 
all 
dummy 
coded 
variables 
created* 
*aKer 
quite 
a 
bit 
of 
work!
Mining Twitter with R
Twi=eR 
• Install 
the 
Twi=eR 
package 
to 
access 
the 
Twi=er 
API 
• Before 
you 
can 
access 
Twi=er 
from 
R 
you 
have 
to: 
– Sign 
up 
for 
a 
Twi=er 
developer 
account 
– Create 
a 
Twi=er 
app 
and 
copy 
the 
authenDcaDon 
details
Twi=er 
authenDcaDon 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
For 
details 
of 
how 
to 
authenDcate 
and 
set 
up 
Twi=eR: 
h=p://thinktostart.wordpress.com/2013/05/22/twi=er-­‐authenDficaDon-­‐with-­‐r/
Twi=er 
authenDcaDon 
############################################# 
# 
1 
-­‐ 
AuthenDcate 
with 
twi=er 
API 
############################################# 
library(twi=eR) 
library(ROAuth) 
api_key 
<-­‐ 
”xxxxxxxxxxxxxxxxxxYsnu5NM" 
api_secret 
<-­‐ 
"wm5kU4xxxxxxxxxxxxxxxxxxxxxxxxxxxxQmMyzuBRbATklN05" 
access_token 
<-­‐ 
"581xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxFJwaiccnAAzScISQlp4o" 
access_token_secret 
<-­‐ 
"tqHnnDDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxnhscT4sTp" 
From 
your 
Twi=er 
applicaDon 
sezngs 
setup_twi=er_oauth(api_key,api_secret,access_token,access_token_secret)
Retrieve 
Twi=er 
account 
details 
############################################################ 
# 
2 
-­‐ 
Get 
details 
of 
Hull 
Uni 
Twi=er 
account 
############################################################ 
twitacc 
<-­‐ 
getUser('UniOfHull') 
twitacc$getDescripDon() 
twitacc$getFollowersCount() 
friends 
= 
twitacc$getFriends(10) 
# 
limit 
the 
number 
of 
friends 
returned 
to 
10 
friends 
> 
twitacc$getDescripDon() 
[1] 
"Our 
official 
Twi=er 
feed 
featuring 
the 
latest 
news 
and 
events 
from 
the 
University 
of 
Hull" 
> 
twitacc$getFollowersCount() 
[1] 
20025 
> 
friends 
= 
twitacc$getFriends(10) 
> 
friends 
$`2644622095` 
[1] 
"HullNursing" 
$`1536529304` 
[1] 
"HullSimulaDon”
Trace 
the 
network 
net 
<-­‐ 
getUser(friends[8]) 
# 
Sheridansmith1 
is 
friend 
no. 
8 
net$getDescripDon() 
net$getFollowersCount() 
net$getFriends(n=10) 
> 
net$getDescripDon() 
[1] 
"Sister 
of 
@damiandsmith 
of 
that 
there 
band 
@_TheTorn 
:) 
x" 
> 
net$getFollowersCount() 
[1] 
502167 
> 
net$getFriends(n=10) 
$`88598283` 
[1] 
"overnightstv" 
$`423373477` 
[1] 
"IsleLoseIt" 
$`19650489` 
[1] 
"donneriron"
Get 
tweets 
and 
write 
to 
file 
############################################################ 
# 
3 
-­‐ 
Search 
Twi=er 
############################################################ 
Twi=er.list 
<-­‐ 
searchTwi=er('@UniOfHull', 
n=500) 
# 
limited 
to 
500 
tweets 
for 
demo 
Twi=er.df 
= 
twListToDF(Twi=er.list) 
write.csv(Twi=er.df, 
file='twi=er.csv', 
row.names=F)
Tweet 
data 
read 
into 
Excel
Tweet 
text 
@UniOfHull
Analyse 
the 
tweets: 
the 
tm 
package 
# 
read 
the 
twi=er 
data 
into 
a 
data 
frame 
(use 
previously 
stored 
.csv 
to 
# 
replicate 
the 
twi=er 
analysis 
presented 
here) 
tweet_raw 
<-­‐ 
read.csv(“twi=erHull.csv", 
stringsAsFactors 
= 
FALSE) 
# 
remove 
the 
non-­‐text 
characters 
tweet_raw$text 
<-­‐ 
gsub("[^[:alnum:]///' 
]", 
"", 
tweet_raw$text) 
# 
build 
a 
corpus, 
which 
is 
a 
collecDon 
of 
text 
documents 
# 
VectorSource 
specifies 
that 
the 
source 
is 
a 
character 
vector 
library(tm) 
myCorpus 
<-­‐ 
Corpus(VectorSource(tweet_raw$text))
Inspect 
the 
corpus 
# 
examine 
the 
sms 
corpus 
inspect(myCorpus[1:3]) 
[[1]] 
Have 
you 
been 
to 
the 
UniOfHull 
popup 
campus 
in 
Leeds 
yet 
It's 
at 
the 
White 
Cloth 
Gallery 
Aire 
Street 
for 
all 
your 
Clearing2014 
queries 
[[2]] 
RT 
UniOfHull 
In 
Newcastle 
and 
looking 
for 
Clearing2014 
advice 
Head 
to 
the 
UniOfHull 
popup 
campus 
at 
Newcastle 
Arts 
Centre 
on 
Westgate 
[[3]] 
RT 
Bethanyn96 
Got 
into 
unio‡ull 
so 
happy
Clean 
corpus 
and 
create 
a 
wordcloud 
# 
clean 
up 
the 
corpus 
using 
tm_map() 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
PlainTextDocument) 
corpus_clean 
<-­‐ 
tm_map(myCorpus, 
tolower) 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
removeNumbers) 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
removeWords, 
stopwords()) 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
removePunctuaDon) 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
stripWhitespace) 
# 
remove 
any 
words 
not 
wanted 
in 
the 
wordcloud 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
removeWords, 
"unio‡ull") 
# 
create 
a 
wordcloud 
library(wordcloud) 
wordcloud(corpus_clean, 
min.freq 
= 
10, 
random.order 
= 
FALSE, 
colors=brewer.pal(8, 
"Dark2"))
August 
2014
SenDment 
analysis 
• What 
is 
the 
senDment 
of 
the 
tweets? 
• How 
does 
the 
number 
of 
posiDve 
words 
compare 
with 
the 
number 
of 
negaDve 
words? 
• The 
number 
of 
posiDve 
words 
minus 
the 
number 
of 
negaDve 
words 
gives 
a 
rough 
indicaDon 
of 
the 
“senDment” 
of 
the 
tweet 
• The 
posiDve 
and 
negaDve 
words 
are 
taken 
form 
a 
word 
list 
developed 
by 
Hu 
and 
Liu: 
– h=p://www.cs.uic.edu/~liub/FBS/opinion-­‐lexicon-­‐ 
English.rar
SenDment 
analysis 
# 
load 
libraries 
library(plyr) 
library(ggplot2) 
# 
load 
the 
score.senDment() 
funcDon 
– 
see 
appendix 
A 
for 
code 
source( 
'senDment.R' 
) 
# 
read 
the 
tweets 
saved 
previously 
hull.tweets 
<-­‐ 
read.csv(“twi=erHull.csv", 
stringsAsFactors 
= 
FALSE) 
# 
read 
the 
lists 
of 
pos 
and 
neg 
words 
from 
Hu 
& 
Liu 
hu.liu.pos 
= 
scan('posiDve-­‐words.txt', 
what='character', 
comment.char=';') 
hu.liu.neg 
= 
scan('negaDve-­‐words.txt', 
what='character', 
comment.char=';')
SenDment 
analysis 
# 
extract 
the 
text 
of 
the 
tweets 
and 
pass 
to 
the 
senDment 
funcDon 
for 
# 
scoring 
(see 
Appendix 
A 
for 
R 
code) 
hull.text 
= 
hull.tweets$text 
hull.scores 
= 
score.senDment(hull.text, 
pos.words, 
neg.words, 
.progress='text') 
# 
make 
a 
histogram 
of 
the 
scores 
ggplot(hull.scores, 
aes(x=score)) 
+ 
geom_histogram(binwidth=1, 
colour="black", 
fill="lightblue") 
+ 
xlab("SenDment 
score") 
+ 
ylab("Frequency") 
+ 
ggDtle("TWITTER: 
Hull 
University 
SenDment 
Analysis")
Number 
of 
posiDve 
word 
matches 
minus 
the 
number 
of 
negaDve 
word 
matches
WriDng 
the 
data 
out 
for 
text 
analysis 
• Many 
text 
analysis 
packages 
require 
each 
comment 
to 
be 
in 
a 
separate 
file 
• If 
the 
data 
is 
in 
Excel 
or 
SPSS 
it 
will 
be 
cumbersome 
to 
generate 
the 
files 
manually 
• Write 
R 
code 
instead
Create 
an 
output 
directory 
# 
create 
an 
output 
directory 
for 
the 
txt 
files 
if 
it 
does 
not 
exist 
mainDir 
<-­‐ 
getwd() 
subDir 
<-­‐ 
"outputText" 
if 
(file.exists(subDir)){ 
setwd(file.path(mainDir, 
subDir)) 
} 
else 
{ 
dir.create(file.path(mainDir, 
subDir)) 
setwd(file.path(mainDir, 
subDir)) 
}
Loop 
to 
write 
the 
files 
# 
find 
out 
how 
many 
rows 
in 
the 
data.frame 
tweets 
= 
nrow(Twi=er.df) 
# 
loop 
to 
write 
the 
txt 
files 
for 
(tweet 
in 
1:tweets) 
{ 
tweetText 
= 
Twi=er.df[tweet, 
1] 
filename 
= 
paste("output", 
tweet, 
".txt", 
sep 
= 
"") 
writeLines(tweetText, 
con 
= 
filename) 
} 
Note 
that 
R 
has 
iterators 
that 
oKen 
remove 
the 
need 
to 
write 
loops, 
see 
the 
“apply” 
family 
of 
funcDons
Accessing Facebook with R
Rfacebook 
## 
loading 
libraries 
library(Rfacebook) 
library(igraph) 
# 
get 
your 
token 
from 
'h=ps://developers.facebook.com/tools/explorer/' 
# 
make 
sure 
you 
give 
permissions 
to 
access 
your 
list 
of 
friends 
# 
set 
your 
FB 
token 
token 
<-­‐ 
“xxxxxxxxxxxxxxCAACEdEose0cBALkyqIxxxxxxxxxxxxxxx” 
For 
details 
of 
how 
to 
authenDcate 
and 
use 
Rfacebook: 
h=p://pablobarbera.com/blog/archives/3.html 
Get 
the 
access 
token 
here: 
h=ps://developers.facebook.com/tools/explorer
Get 
the 
data 
and 
graph 
the 
network 
# 
download 
adjacency 
matrix 
for 
network 
of 
Facebook 
friends 
my_network 
<-­‐ 
getNetwork(token, 
format="adj.matrix") 
# 
friends 
who 
are 
friends 
with 
me 
alone 
singletons 
<-­‐ 
rowSums(my_network)==0 
# 
graph 
the 
network 
my_graph 
<-­‐ 
graph.adjacency(my_network[!singletons,! 
singletons]) 
layout 
<-­‐ 
layout.drl(my_graph,opDons=list(simmer.a=racDon=0)) 
plot(my_graph, 
vertex.size=2, 
#vertex.label=NA, 
vertex.label.cex=0.5, 
edge.arrow.size=0, 
edge.curved=TRUE,layout=layout)
Facebook 
network 
graph 
Ok, 
it’s 
ugly 
– 
there 
are 
plenty 
more 
social 
network 
analysis 
and 
graphing 
packages 
in 
R 
to 
try, 
or 
you 
can 
write 
a 
bit 
of 
code 
to 
export 
the 
adjacency 
matrix 
to 
another 
package, 
e.g., 
UCINET/Netdraw, 
Pajek, 
Gephi
write.graph(graph 
= 
my_graph, 
file 
= 
'‹.gml', 
format 
= 
'gml') 
VisualizaDon 
in 
Gephi
To 
find 
out 
what 
funcDons 
a 
package 
supports, 
what 
they 
do 
and 
how 
to 
call 
them 
see 
the 
documentaDon
Beg, 
borrow, 
steal 
code! 
• Don’t 
bother 
wriDng 
code 
from 
scratch 
• If 
you 
want 
to 
know 
how 
to 
do 
something 
then 
Google 
it 
• There 
will 
likely 
be 
a 
soluDon 
that 
you 
can 
scrape 
off 
the 
screen 
and 
modify 
for 
your 
own 
purposes 
• For 
example 
– How 
would 
you 
remove 
rows 
from 
a 
dataframe 
with 
missing 
values 
(NA)? 
– Try 
“r 
how 
to 
remove 
missing 
values” 
in 
Google
What’s 
next? 
• Access 
data 
held 
in 
SQL 
databases 
and 
store 
your 
own 
data, 
e.g., 
MySQL, 
and 
the 
packages 
• Read, 
write, 
manipulate 
Excel 
spreadsheets 
using 
xls 
and 
XLConnect 
• Access 
maps, 
e.g., 
Google 
Maps, 
and 
overlay 
locaDon 
data 
(e.g., 
Tweets) 
on 
a 
map 
• Screen 
scrape 
Web 
sites 
that 
don’t 
have 
APIs 
(e.g., 
Google 
Scholar)
Suggested 
further 
reading 
and 
resources 
• Lantz, 
B., 
(2013). 
Machine 
Learning 
with 
R. 
Packt 
Publishing. 
(highly 
recommended) 
• Miller, 
T., 
(2014). 
Modeling 
Techniques 
in 
Predic;ve 
Analy;cs: 
Business 
Problems 
and 
Solu;ons. 
Pearson 
EducaDon. 
• R 
Reference 
Card 
2.0 
– 
h=p://cran.r-­‐project.org/doc/contrib/Baggo=-­‐refcard-­‐v2.pdf 
• R 
Reference 
Card 
for 
Data 
Mining 
– h=p://cran.r-­‐project.org/doc/contrib/YanchangZhao-­‐refcard-­‐data-­‐mining.pdf 
• 
R-­‐bloggers 
for 
news 
and 
tutorials 
– h=p://www.r-­‐bloggers.com
Appendices
Appendix 
A 
The 
score.senDment() 
funcDon 
#' 
#' score.sentiment() implements a very simple algorithm 
to estimate 
#' sentiment, assigning a integer score by subtracting 
the number 
#' of occurrences of negative words from that of 
positive words. 
#' 
#' @param sentences vector of text to score 
#' @param pos.words vector of words of postive sentiment 
#' @param neg.words vector of words of negative 
sentiment 
#' @param .progress passed to <code>laply()</code> to 
control of progress bar. 
#' @returnType data.frame 
#' @return data.frame of text and corresponding 
sentiment scores 
#' @author Jefrey Breen jbreen@cambridge.aero 
h=ps://github.com/jeffreybreen/twi=er-­‐senDment-­‐analysis-­‐tutorial-­‐201107
score.sentiment = function(sentences, pos.words, 
neg.words, .progress='none') 
{ 
require(plyr) 
require(stringr) 
# we got a vector of sentences. plyr will handle a list or a 
vector as an "l" for us 
# we want a simple array of scores back, so we use "l" + "a" 
+ "ply" = laply: 
scores = laply(sentences, function(sentence, pos.words, 
neg.words) { 
# remove the non-text characters 
sentence <- gsub("[^[:alnum:]///' ]", "", sentence) 
# clean up sentences with R's regex-driven global 
substitute, gsub(): 
sentence = gsub('[[:punct:]]', '', sentence) 
sentence = gsub('[[:cntrl:]]', '', sentence) 
sentence = gsub('d+', '', sentence) 
# and convert to lower case: 
sentence = tolower(sentence)
# split into words. str_split is in the stringr package 
word.list = str_split(sentence, 's+') 
# sometimes a list() is one level of hierarchy too much 
words = unlist(word.list) 
# compare our words to the dictionaries of positive & 
negative terms 
pos.matches = match(words, pos.words) 
neg.matches = match(words, neg.words) 
# match() returns the position of the matched term or NA 
# we just want a TRUE/FALSE: 
pos.matches = !is.na(pos.matches) 
neg.matches = !is.na(neg.matches) 
# and conveniently enough, TRUE/FALSE will be treated as 
1/0 by sum(): 
score = sum(pos.matches) - sum(neg.matches) 
return(score) 
}, pos.words, neg.words, .progress=.progress ) 
scores.df = data.frame(score=scores, text=sentences) 
return(scores.df) 
}

Weitere ähnliche Inhalte

Was ist angesagt?

The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of RAnalyticsWeek
 
R programming Language , Rahul Singh
R programming Language , Rahul SinghR programming Language , Rahul Singh
R programming Language , Rahul SinghRavi Basil
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-iDr. Awase Khirni Syed
 
Introduction to Rstudio
Introduction to RstudioIntroduction to Rstudio
Introduction to RstudioOlga Scrivner
 
R programming language
R programming languageR programming language
R programming languageKeerti Verma
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programmingNimrita Koul
 
R programming Fundamentals
R programming  FundamentalsR programming  Fundamentals
R programming FundamentalsRagia Ibrahim
 
R programming presentation
R programming presentationR programming presentation
R programming presentationAkshat Sharma
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorialGRajendra
 
Intro to R statistic programming
Intro to R statistic programming Intro to R statistic programming
Intro to R statistic programming Bryan Downing
 

Was ist angesagt? (20)

The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
R programming Language , Rahul Singh
R programming Language , Rahul SinghR programming Language , Rahul Singh
R programming Language , Rahul Singh
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-i
 
Introduction to Rstudio
Introduction to RstudioIntroduction to Rstudio
Introduction to Rstudio
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
 
R programming language
R programming languageR programming language
R programming language
 
R Programming
R ProgrammingR Programming
R Programming
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programming
 
R programming Fundamentals
R programming  FundamentalsR programming  Fundamentals
R programming Fundamentals
 
R programming
R programmingR programming
R programming
 
R programming
R programmingR programming
R programming
 
R programming presentation
R programming presentationR programming presentation
R programming presentation
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R programming
R programmingR programming
R programming
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
 
Rtutorial
RtutorialRtutorial
Rtutorial
 
Intro to R statistic programming
Intro to R statistic programming Intro to R statistic programming
Intro to R statistic programming
 
Introduction to statistical software R
Introduction to statistical software RIntroduction to statistical software R
Introduction to statistical software R
 

Andere mochten auch

YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016Richard Vidgen
 
Big data and value creation
Big data and value creationBig data and value creation
Big data and value creationRichard Vidgen
 
ProSIS - pro social information systems - Vidgen March 2013
ProSIS - pro social information systems - Vidgen March 2013ProSIS - pro social information systems - Vidgen March 2013
ProSIS - pro social information systems - Vidgen March 2013Richard Vidgen
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataRichard Vidgen
 
2 R Tutorial Programming
2 R Tutorial Programming2 R Tutorial Programming
2 R Tutorial ProgrammingSakthi Dasans
 
R Introduction
R IntroductionR Introduction
R Introductionschamber
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching moduleSander Timmer
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycVivian S. Zhang
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environmentizahn
 
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...Goran S. Milovanovic
 
R programming Basic & Advanced
R programming Basic & AdvancedR programming Basic & Advanced
R programming Basic & AdvancedSohom Ghosh
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 

Andere mochten auch (20)

YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016
 
Big data and value creation
Big data and value creationBig data and value creation
Big data and value creation
 
ProSIS - pro social information systems - Vidgen March 2013
ProSIS - pro social information systems - Vidgen March 2013ProSIS - pro social information systems - Vidgen March 2013
ProSIS - pro social information systems - Vidgen March 2013
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Seefeld stats r_bio
Seefeld stats r_bioSeefeld stats r_bio
Seefeld stats r_bio
 
Using r
Using rUsing r
Using r
 
R presentation
R presentationR presentation
R presentation
 
2 R Tutorial Programming
2 R Tutorial Programming2 R Tutorial Programming
2 R Tutorial Programming
 
Using R For Statistics
Using R For StatisticsUsing R For Statistics
Using R For Statistics
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching module
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Minitab 17
Minitab 17Minitab 17
Minitab 17
 
R programming Basic & Advanced
R programming Basic & AdvancedR programming Basic & Advanced
R programming Basic & Advanced
 
Non Parametric Tests
Non Parametric TestsNon Parametric Tests
Non Parametric Tests
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 

Ähnlich wie R tutorial

Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statisticsIBM
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxmyworld93
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptxsalutiontechnology
 
Introduction to r
Introduction to rIntroduction to r
Introduction to rgslicraf
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTBUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTHaritikaChhatwal1
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
R programming for psychometrics
R programming for psychometricsR programming for psychometrics
R programming for psychometricsDiane Talley
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
 
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...DataMind-slides
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RGraphRM
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Research paper presentation
Research paper presentation Research paper presentation
Research paper presentation Akshat Sharma
 
Unit1_Introduction to R.pdf
Unit1_Introduction to R.pdfUnit1_Introduction to R.pdf
Unit1_Introduction to R.pdfMDDidarulAlam15
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance Anaya Zafar
 
An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folksThomas Hütter
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 

Ähnlich wie R tutorial (20)

Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statistics
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptx
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptx
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTBUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
R programming for psychometrics
R programming for psychometricsR programming for psychometrics
R programming for psychometrics
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Research paper presentation
Research paper presentation Research paper presentation
Research paper presentation
 
Unit1_Introduction to R.pdf
Unit1_Introduction to R.pdfUnit1_Introduction to R.pdf
Unit1_Introduction to R.pdf
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 
An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folks
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 

Kürzlich hochgeladen

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 

Kürzlich hochgeladen (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 

R tutorial

  • 1. An introduction to R Version 1.4 November 2014 h=p://datasciencebusiness.wordpress.com Prof. Richard Vidgen Management Systems Hull University Business School E: r.vidgen@hull.ac.uk
  • 2. Aims and use of this presentaDon This presentaDon provides an introducDon to R for anyone who wants to get an idea of what R is and what it can do. Although no prior experience of R is needed, a basic understanding of data analysis (e.g., mulDple regression) is assumed, as is a basic technical competence (e.g., installing soKware and managing directory structures). If you are already using SPSS you will get a feel for how it compares with R. It’s a work in progress and will be updated based on experience and feedback. Docendo discimus, (LaDn "by teaching, we learn").
  • 3. Contents • PredicDve analyDcs and analyDcs tools • An overview of R • Installing R and an R IDE (integrated development environment) • R syntax and data types • MulDple regression in R and SPSS • The Twi=eR package • The Rfacebook package
  • 4. Resources • The R source and data files for this tutorial can be accessed at: – h=p://datasciencebusiness.wordpress.com • R_intro.R • R_regression.R • R_twi=er.R • R_twi=er_senDment.R • senDment.R • R_facebook.R • insurance.csv • twi=erHull.csv • posiDve-­‐words.txt • negaDve-­‐words.txt WARNING: when R packages are updated by developers things can break and your R programs may stop working. The code in the files above is tested regularly with the latest packages and updated as necessary
  • 6. Be=er decisions -­‐ predicDve analyDcs • A predicDve model that calculates strawberry purchases based on: – Weather forecast – Store temperature – Freezer sensor data – Remaining stock per shelf life – Sales transacDon point of sale feeds – Web searches, social menDons h=p://www.slideshare.net/datasciencelondon/big-­‐data-­‐sorry-­‐data-­‐science-­‐what-­‐does-­‐a-­‐data-­‐scienDst-­‐do
  • 7. PredicDve analyDcs • For example, what data might help us predict which students will drop out? – Assessment grades at University – Prior educaDon a=ainment – Social background – Distance of home from University – Friendship circles and networks (e.g., sports club memberships) – A=endance at lectures and tutorials – InteracDon in lectures and tutorials – Time spent on campus – Time spent in library – Number of accesses to electronic learning resources – Text books purchased – Engagement in subject-­‐related forums – SenDment of social media posts – Etc.
  • 9. Some of the techniques data scienDsts use • ClassificaDon • Clustering • AssociaDon rules • Decision trees • Regression • GeneDc algorithms • Neural networks and support vector machines • Machine learning • Natural language processing • SenDment analysis • ArDficial intelligence • Time series analysis • SimulaDons • Social network analysis
  • 10. Technologies for data analysis: usage rates King, J., & R. Magoulas (2013). Data Science Salary Survey. O’Reilly Media. R and Python programming languages come above Excel Enterprise products bo=om of the heap
  • 11. Data scienDst as “bricoleur” “In the pracDcal arts and the fine arts, bricolage (French for "Dnkering") is the construcDon or creaDon of a work from a diverse range of things that happen to be available, or a work created by such a process.” Wikipedia
  • 13. What is R? R is an open source computer language used for data manipulaDon, staDsDcs, and graphics.
  • 14. History of R • 1976 – Bell Labs develops S, a language for data analysis; released commercially as S-­‐plus • 1990s – R wri=en and released as open source by (R)oss Ihaka and (R)obert Gentleman • 1997 – The Comprehensive R Archive Network (CRAN) launched • August 2014 – CRAN repository contains 5789 user-­‐contributed packages
  • 15. Benefits of R • It’s free! • Runs on mulDple plaqorms (Windows, Unix, MacOS) • ValidaDon/replicaDon of analyses (assumes commented code and documentaDon) • Long term efficiency (using the same code for mulDple projects)
  • 16. SPSS* vs R SPSS • Limited ability for data scienDst to change the environment • Data scienDst relies on algorithms developed by SPSS • Problem-­‐solving constrained by SPSS developers • Must pay for using the constrained algorithms R • Can use funcDons made by a global community of staDsDcs researchers or create their own • Almost unlimited in their ability to change their environment • Can do things SPSS users cannot even dream of • Get all this for free *or any other proprietary closed soKware system
  • 19. 2. Output appears here 1. Type in commands, select the text and run with (cmd + return) or menu opDon: edit | execute
  • 20. R integrated development environments (IDEs) • Some free IDEs – RevoluDon R Enterprise – Architect – R Studio • Most widely used R IDE • It’s simple and intuiDve • Used to build this tutorial
  • 23. R Studio h=p://www.rstudio.com Install R Studio from here
  • 24. R Studio Type code here Results appear here Environment and history Packages, plots, files
  • 26. Basic grammar of R object = funcDon(arguments)
  • 27. Guess what this does Z <-­‐ read.table(“MyFile.txt”)
  • 28. Two ways of doing it = is the same as <-­‐
  • 30. Reading the slides # Comments are in blue <-­‐ Code is in green Output is in black
  • 31. Set the working directory # For the tutorial, load all the R program and data files provided (see Resources slide) # into a directory of your choice # Set your working directory to this directory, e.g., for a Mac setwd("/Users/ … somewhere on your computer … /R_tutorial") # and for Windows Setwd("C:/ … somewhere on your computer … /R_tutorial") # List the files in the directory list.files() > list.files() [1] "insurance.csv" "negaDve-­‐words.txt" "posiDve-­‐words.txt" [4] "R_facebook.R" "R_intro.R" "R_regression.R" [7] "R_twi=er_senDment.R" "R_twi=er.R" "senDment.R" [10] "twi=erHull.csv"
  • 32. Data types and data structures Data types Numeric Character Logical Data structures Vectors Lists MulD-­‐dimensional Matrices Dataframes
  • 33. Vectors and data classes Values can be combined into vectors using the c() funcDon # this is a comment! num.var <- c(1, 2, 3, 4) # numeric vector! char.var <- c("1", "2", "3", "4") # character vector! log.var <- c(TRUE, TRUE, FALSE, TRUE) # logical vector! Vectors have a class which determines how funcDons treat them > class(num.var)! [1] "numeric"! > class(char.var)! [1] "character"! > class(log.var)! [1] "logical"!
  • 34. Vectors and data classes Can calculate mean of a numeric vector, but not of a character vector > mean(num.var)! [1] 2.5! > mean(char.var)! [1] NA! Warning message:! In mean.default(char.var) :! argument is not numeric or logical: returning NA!
  • 35. Lists # create a list -­‐ a collecDon of vectors employees <-­‐ c("John", "Sunil", "Anna") yearsService <-­‐ c(3, 2, 6) empDetails <-­‐ list(employees, yearsService) class(empDetails) empDetails > class(empDetails) [1] "list" > empDetails [[1]] [1] "John" "Sunil" "Anna" [[2]] [1] 3 2 6
  • 36. Dataframes A data.frame is a list of vectors, each of the same length DF <- data.frame(x=1:5, y=letters[1:5], z=letters[6:10])! > DF # data.frame with 3 columns and 5 rows! x y Z! 1 1 a f! 2 2 b g! 3 3 c h! 4 4 d i! 5 5 e j!
  • 38. insurance.csv • insurance.csv contains medical expenses for paDents enrolled in a healthcare plan • The data file contains 1,338 cases with features of the paDent as well as the total medical expenses charged to the paDent’s healthcare plan for the calendar year • There are no missing values (these would be shown as NA in R indicaDng empty or null)
  • 39. insurance.csv Variable Descrip=on age an integer indicaDng the age of the beneficiary sex Either “male” or “female” bmi body mass index (BMI), which gives an indicaDon of how over or under-­‐ weight a person is. BMI is calculated as weight in kilograms divideD by height in metres squared. An ideal BMI is in the range 18.5 to 24.9 children an integer showing the number of children/dependents covered by the plan smoker “yes” or “no” region the beneficiary’s place of residence, divided into four regions: “northeast”, “southeast”, “southwest”, or “northwest”. This example is taken from Lantz (2013)
  • 40. Read the data insurance <-­‐ read.csv("insurance.csv", stringsAsFactors = TRUE) head(insurance) age sex bmi children smoker region charges! 1 19 female 27.900 0 yes southwest 16884.924! 2 18 male 33.770 1 no southeast 1725.552! 3 28 male 33.000 3 no southeast 4449.462! 4 33 male 22.705 0 no northwest 21984.471! 5 32 male 28.880 0 no northwest 3866.855! 6 31 female 25.740 0 no southeast 3756.622!
  • 41. Working with data • There are specific funcDons for reading from (and wriDng to) excel, SPSS, SAS, etc. • However, the simplest way is to export and import files in csv (comma separated values) format – the lingua franca of data
  • 42. Explore the data summary(insurance$charges)! Min. 1st Qu. Median Mean 3rd Qu. Max. ! 1122 4740 9382 13270 16640 63770 ! table(insurance$region)! northeast northwest southeast southwest ! 324 325 364 325 ! cor(insurance[c("age", "bmi", "children", "charges")])! age bmi children charges! age 1.0000000 0.1092719 0.04246900 0.29900819! bmi 0.1092719 1.0000000 0.01275890 0.19834097! children 0.0424690 0.0127589 1.00000000 0.06799823! charges 0.2990082 0.1983410 0.06799823 1.00000000!
  • 43. Visualise the data hist(insurance$charges)
  • 44. Visualise the data pairs(insurance[c("age", "bmi", "children", "charges")])
  • 45. Visualise the data -­‐ be=er library(psych) pairs.panels(insurance[c("age", "bmi", "children", "charges")], hist.col="yellow”)
  • 46. Installing packages • pairs.panels is a funcDon in the psych package, which needs to be installed:
  • 47. MulDple regression 1 ins_model1 <- lm(charges ~ age + children + bmi, data = insurance)! Residuals:! Min 1Q Median 3Q Max ! -13884 -6994 -5092 7125 48627 ! ! Coefficients:! 11.8% of variaDon in insurance charges is explained by the model Estimate Std. Error t value Pr(>|t|) ! (Intercept) -6916.24 1757.48 -3.935 8.74e-05 ***! age 239.99 22.29 10.767 < 2e-16 ***! children 542.86 258.24 2.102 0.0357 * ! bmi 332.08 51.31 6.472 1.35e-10 ***! ---! Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1! ! Residual standard error: 11370 on 1334 degrees of freedom! Multiple R-squared: 0.1201, !Adjusted R-squared: 0.1181 ! F-statistic: 60.69 on 3 and 1334 DF, p-value: < 2.2e-16!
  • 48. SPSS
  • 49.
  • 50. MulDple regression 2 ins_model2 <- lm(charges ~ age + children + bmi + sex + smoker + region, data = insurance)! Residuals:! Min 1Q Median 3Q Max ! -11304.9 -2848.1 -982.1 1393.9 29992.8 ! ! Coefficients:! 74.9% of variaDon in insurance charges is explained by the model Estimate Std. Error t value Pr(>|t|) ! (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***! age 256.9 11.9 21.587 < 2e-16 ***! children 475.5 137.8 3.451 0.000577 ***! bmi 339.2 28.6 11.860 < 2e-16 ***! sexmale -131.3 332.9 -0.394 0.693348 ! smokeryes 23848.5 413.1 57.723 < 2e-16 ***! regionnorthwest -353.0 476.3 -0.741 0.458769 ! regionsoutheast -1035.0 478.7 -2.162 0.030782 * ! regionsouthwest -960.0 477.9 -2.009 0.044765 * ! ---! Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1! ! Residual standard error: 6062 on 1329 degrees of freedom! Multiple R-squared: 0.7509, !Adjusted R-squared: 0.7494 ! F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16!
  • 51. Dummy variables • R coded character vectors as factors and automaDcally analysed them as dummy variables – stringsAsFactors = TRUE • In SPSS these need to be coded by hand
  • 52. SPSS -­‐ create dummy variables • Recode sex • 0 = female • 1 = male • Recode smoker – 1 = yes – 0 = no • Recode region – Number of dummies = number of groups – 1 – = 4 – 1 = 3 • northeast = 0, 0, 0 • northwest = 1, 0, 0 • southeast = 0, 1, 0 • southwest = 0, 0, 1
  • 56. Data set with all dummy coded variables created* *aKer quite a bit of work!
  • 57.
  • 59. Twi=eR • Install the Twi=eR package to access the Twi=er API • Before you can access Twi=er from R you have to: – Sign up for a Twi=er developer account – Create a Twi=er app and copy the authenDcaDon details
  • 60. Twi=er authenDcaDon xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx For details of how to authenDcate and set up Twi=eR: h=p://thinktostart.wordpress.com/2013/05/22/twi=er-­‐authenDficaDon-­‐with-­‐r/
  • 61. Twi=er authenDcaDon ############################################# # 1 -­‐ AuthenDcate with twi=er API ############################################# library(twi=eR) library(ROAuth) api_key <-­‐ ”xxxxxxxxxxxxxxxxxxYsnu5NM" api_secret <-­‐ "wm5kU4xxxxxxxxxxxxxxxxxxxxxxxxxxxxQmMyzuBRbATklN05" access_token <-­‐ "581xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxFJwaiccnAAzScISQlp4o" access_token_secret <-­‐ "tqHnnDDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxnhscT4sTp" From your Twi=er applicaDon sezngs setup_twi=er_oauth(api_key,api_secret,access_token,access_token_secret)
  • 62. Retrieve Twi=er account details ############################################################ # 2 -­‐ Get details of Hull Uni Twi=er account ############################################################ twitacc <-­‐ getUser('UniOfHull') twitacc$getDescripDon() twitacc$getFollowersCount() friends = twitacc$getFriends(10) # limit the number of friends returned to 10 friends > twitacc$getDescripDon() [1] "Our official Twi=er feed featuring the latest news and events from the University of Hull" > twitacc$getFollowersCount() [1] 20025 > friends = twitacc$getFriends(10) > friends $`2644622095` [1] "HullNursing" $`1536529304` [1] "HullSimulaDon”
  • 63. Trace the network net <-­‐ getUser(friends[8]) # Sheridansmith1 is friend no. 8 net$getDescripDon() net$getFollowersCount() net$getFriends(n=10) > net$getDescripDon() [1] "Sister of @damiandsmith of that there band @_TheTorn :) x" > net$getFollowersCount() [1] 502167 > net$getFriends(n=10) $`88598283` [1] "overnightstv" $`423373477` [1] "IsleLoseIt" $`19650489` [1] "donneriron"
  • 64.
  • 65. Get tweets and write to file ############################################################ # 3 -­‐ Search Twi=er ############################################################ Twi=er.list <-­‐ searchTwi=er('@UniOfHull', n=500) # limited to 500 tweets for demo Twi=er.df = twListToDF(Twi=er.list) write.csv(Twi=er.df, file='twi=er.csv', row.names=F)
  • 66. Tweet data read into Excel
  • 68. Analyse the tweets: the tm package # read the twi=er data into a data frame (use previously stored .csv to # replicate the twi=er analysis presented here) tweet_raw <-­‐ read.csv(“twi=erHull.csv", stringsAsFactors = FALSE) # remove the non-­‐text characters tweet_raw$text <-­‐ gsub("[^[:alnum:]///' ]", "", tweet_raw$text) # build a corpus, which is a collecDon of text documents # VectorSource specifies that the source is a character vector library(tm) myCorpus <-­‐ Corpus(VectorSource(tweet_raw$text))
  • 69. Inspect the corpus # examine the sms corpus inspect(myCorpus[1:3]) [[1]] Have you been to the UniOfHull popup campus in Leeds yet It's at the White Cloth Gallery Aire Street for all your Clearing2014 queries [[2]] RT UniOfHull In Newcastle and looking for Clearing2014 advice Head to the UniOfHull popup campus at Newcastle Arts Centre on Westgate [[3]] RT Bethanyn96 Got into unio‡ull so happy
  • 70. Clean corpus and create a wordcloud # clean up the corpus using tm_map() corpus_clean <-­‐ tm_map(corpus_clean, PlainTextDocument) corpus_clean <-­‐ tm_map(myCorpus, tolower) corpus_clean <-­‐ tm_map(corpus_clean, removeNumbers) corpus_clean <-­‐ tm_map(corpus_clean, removeWords, stopwords()) corpus_clean <-­‐ tm_map(corpus_clean, removePunctuaDon) corpus_clean <-­‐ tm_map(corpus_clean, stripWhitespace) # remove any words not wanted in the wordcloud corpus_clean <-­‐ tm_map(corpus_clean, removeWords, "unio‡ull") # create a wordcloud library(wordcloud) wordcloud(corpus_clean, min.freq = 10, random.order = FALSE, colors=brewer.pal(8, "Dark2"))
  • 72. SenDment analysis • What is the senDment of the tweets? • How does the number of posiDve words compare with the number of negaDve words? • The number of posiDve words minus the number of negaDve words gives a rough indicaDon of the “senDment” of the tweet • The posiDve and negaDve words are taken form a word list developed by Hu and Liu: – h=p://www.cs.uic.edu/~liub/FBS/opinion-­‐lexicon-­‐ English.rar
  • 73. SenDment analysis # load libraries library(plyr) library(ggplot2) # load the score.senDment() funcDon – see appendix A for code source( 'senDment.R' ) # read the tweets saved previously hull.tweets <-­‐ read.csv(“twi=erHull.csv", stringsAsFactors = FALSE) # read the lists of pos and neg words from Hu & Liu hu.liu.pos = scan('posiDve-­‐words.txt', what='character', comment.char=';') hu.liu.neg = scan('negaDve-­‐words.txt', what='character', comment.char=';')
  • 74. SenDment analysis # extract the text of the tweets and pass to the senDment funcDon for # scoring (see Appendix A for R code) hull.text = hull.tweets$text hull.scores = score.senDment(hull.text, pos.words, neg.words, .progress='text') # make a histogram of the scores ggplot(hull.scores, aes(x=score)) + geom_histogram(binwidth=1, colour="black", fill="lightblue") + xlab("SenDment score") + ylab("Frequency") + ggDtle("TWITTER: Hull University SenDment Analysis")
  • 75. Number of posiDve word matches minus the number of negaDve word matches
  • 76. WriDng the data out for text analysis • Many text analysis packages require each comment to be in a separate file • If the data is in Excel or SPSS it will be cumbersome to generate the files manually • Write R code instead
  • 77. Create an output directory # create an output directory for the txt files if it does not exist mainDir <-­‐ getwd() subDir <-­‐ "outputText" if (file.exists(subDir)){ setwd(file.path(mainDir, subDir)) } else { dir.create(file.path(mainDir, subDir)) setwd(file.path(mainDir, subDir)) }
  • 78. Loop to write the files # find out how many rows in the data.frame tweets = nrow(Twi=er.df) # loop to write the txt files for (tweet in 1:tweets) { tweetText = Twi=er.df[tweet, 1] filename = paste("output", tweet, ".txt", sep = "") writeLines(tweetText, con = filename) } Note that R has iterators that oKen remove the need to write loops, see the “apply” family of funcDons
  • 79.
  • 81. Rfacebook ## loading libraries library(Rfacebook) library(igraph) # get your token from 'h=ps://developers.facebook.com/tools/explorer/' # make sure you give permissions to access your list of friends # set your FB token token <-­‐ “xxxxxxxxxxxxxxCAACEdEose0cBALkyqIxxxxxxxxxxxxxxx” For details of how to authenDcate and use Rfacebook: h=p://pablobarbera.com/blog/archives/3.html Get the access token here: h=ps://developers.facebook.com/tools/explorer
  • 82. Get the data and graph the network # download adjacency matrix for network of Facebook friends my_network <-­‐ getNetwork(token, format="adj.matrix") # friends who are friends with me alone singletons <-­‐ rowSums(my_network)==0 # graph the network my_graph <-­‐ graph.adjacency(my_network[!singletons,! singletons]) layout <-­‐ layout.drl(my_graph,opDons=list(simmer.a=racDon=0)) plot(my_graph, vertex.size=2, #vertex.label=NA, vertex.label.cex=0.5, edge.arrow.size=0, edge.curved=TRUE,layout=layout)
  • 83. Facebook network graph Ok, it’s ugly – there are plenty more social network analysis and graphing packages in R to try, or you can write a bit of code to export the adjacency matrix to another package, e.g., UCINET/Netdraw, Pajek, Gephi
  • 84. write.graph(graph = my_graph, file = '‹.gml', format = 'gml') VisualizaDon in Gephi
  • 85. To find out what funcDons a package supports, what they do and how to call them see the documentaDon
  • 86. Beg, borrow, steal code! • Don’t bother wriDng code from scratch • If you want to know how to do something then Google it • There will likely be a soluDon that you can scrape off the screen and modify for your own purposes • For example – How would you remove rows from a dataframe with missing values (NA)? – Try “r how to remove missing values” in Google
  • 87. What’s next? • Access data held in SQL databases and store your own data, e.g., MySQL, and the packages • Read, write, manipulate Excel spreadsheets using xls and XLConnect • Access maps, e.g., Google Maps, and overlay locaDon data (e.g., Tweets) on a map • Screen scrape Web sites that don’t have APIs (e.g., Google Scholar)
  • 88. Suggested further reading and resources • Lantz, B., (2013). Machine Learning with R. Packt Publishing. (highly recommended) • Miller, T., (2014). Modeling Techniques in Predic;ve Analy;cs: Business Problems and Solu;ons. Pearson EducaDon. • R Reference Card 2.0 – h=p://cran.r-­‐project.org/doc/contrib/Baggo=-­‐refcard-­‐v2.pdf • R Reference Card for Data Mining – h=p://cran.r-­‐project.org/doc/contrib/YanchangZhao-­‐refcard-­‐data-­‐mining.pdf • R-­‐bloggers for news and tutorials – h=p://www.r-­‐bloggers.com
  • 90. Appendix A The score.senDment() funcDon #' #' score.sentiment() implements a very simple algorithm to estimate #' sentiment, assigning a integer score by subtracting the number #' of occurrences of negative words from that of positive words. #' #' @param sentences vector of text to score #' @param pos.words vector of words of postive sentiment #' @param neg.words vector of words of negative sentiment #' @param .progress passed to <code>laply()</code> to control of progress bar. #' @returnType data.frame #' @return data.frame of text and corresponding sentiment scores #' @author Jefrey Breen jbreen@cambridge.aero h=ps://github.com/jeffreybreen/twi=er-­‐senDment-­‐analysis-­‐tutorial-­‐201107
  • 91. score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) # we got a vector of sentences. plyr will handle a list or a vector as an "l" for us # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply: scores = laply(sentences, function(sentence, pos.words, neg.words) { # remove the non-text characters sentence <- gsub("[^[:alnum:]///' ]", "", sentence) # clean up sentences with R's regex-driven global substitute, gsub(): sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('d+', '', sentence) # and convert to lower case: sentence = tolower(sentence)
  • 92. # split into words. str_split is in the stringr package word.list = str_split(sentence, 's+') # sometimes a list() is one level of hierarchy too much words = unlist(word.list) # compare our words to the dictionaries of positive & negative terms pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) # match() returns the position of the matched term or NA # we just want a TRUE/FALSE: pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(score=scores, text=sentences) return(scores.df) }