Quantifying Text Sentiment in R Using Dictionaries and Scoring Functions

Happy,
Sad,
Indiﬀerent
…

Quan3fying
Text
Sen3ment
in
R

Rajarshi
Guha

CT
R
Users
Group

May
2012

Preamble

•  hHps://github.com/rajarshi/ctrug-‐tweet

•  Focus
is
on
using
R
to
perform
this
task

•  Won’t
comment
on
validity,
rigor,
u3lity,
…
of

sen3ment
analysis
methods

•  Some
of
the
example
data
is
available
freely,

other
parts
available
on
request

GeUng
TwiHer
Data

•  Based
on
a
collabora3on
with
Prof.
Debs
Ghosh

(Uconn),
studying
obesity
&
social
media

•  Accessing
TwiHer
is
easy
using
many
languages

–  We
obtained
tweets
via
a
PHP
client
running
over
an

extended
period
of
3me

–  Ended
up
with
108,164
tweets

•  Won’t
focus
on
accessing
TwiHer
data
from
R

–  Very
straighaorward
with
twitteR

Cleaning
Text

•  Load
in
tweet
data,
get
rid
of
urls,
HTML

escape
codes,
punctua3on
etc

d
<-‐
read.csv('pizza-‐unique.csv',
colClass='character',

comment='',
header=TRUE)

d$geox
<-‐
as.numeric(d$geox)

d$geoy
<-‐
as.numeric(d$geoy)

remove.urls
<-‐
function(x)
gsub("http.*$",
"",
gsub('http.*s',
'
',
x))

remove.html
<-‐
function(x)
gsub('"',
'',
x)

d$text
<-‐
remove.urls(d$text)

d$text
<-‐
remove.html(d$text)

d$text
<-‐
gsub("@",
"FOOBAZ",
d$text)

d$text
<-‐
gsub("[[:punct:]]+",
"
",
d$text)

d$text
<-‐
gsub("FOOBAZ",
"@",
d$text)

d$text
<-‐
gsub("[[:space:]]+",
'
',
d$text)

d$text
<-‐
tolower(d$text)

Quan3fying
Sen3ment

•  Based
on
iden3fying
words
with
posi3ve
or

nega3ve
connota3ons

•  Fundamentally
based
on
looking
up
words

from
a
dic3onary

•  If
a
tweet
has
more
posi3ve
words
than

nega3ve
words,
the
tweet
is
posi3ve

•  More
sophis3cated
scoring
schemes
are

possible

BeHer
Dic3onaries?

•  Sen3WordNet

–  Derived
from
WordNet,
each
term
is
assigned
a

posi3vity
and
nega3vity
score

–  206K
terms

–  Converted
to
simple

1.0

CSV
for
easy
import

0.8

into
R
0.6 Sentiment
Proportion

•  Ideally,
should

negative
neutral
positive
0.4

perform
POS
tagging

0.2

0.0

adjective adverb noun verb

Scoring
Tweets

•  Given
a
scoring
func3on,
we
can
process
the

tweets
swn
<-‐
read.csv('sentinet_r.csv',
header=TRUE,

–  Perfect
use

as.is=TRUE)

case
for

swn.match
<-‐
function(w)
{

parallel

tmp
<-‐
subset(swn,
Term
==
w)

if
(nrow(tmp)
>=
1)
return(tmp[1,c(3,4)])

processing

else
return(c(0,0))

}

–  Easily
switch

out
the

score.swn
<-‐
function(tweet)
{

words
<-‐
strsplit(tweet,
"s+")[[1]]

scoring

cs
<-‐
colSums(do.call('rbind',

func3on

lapply(words,
function(z)

swn.match(z))))

return(cs[1]-‐cs[2])

}

scores
<-‐
mclapply(d$text,
score.swn)

Proﬁling
Makes
Me
Happy

swn.match
<-‐
function(w)
{

•  6052
sec
with

tmp
<-‐
subset(swn,
Term
==
w)

if
(nrow(tmp)
>=
1)
return(tmp[1,c(3,4)])

24
cores

else
return(c(0,0))

}

•  Rprof()
is
a

score.swn
<-‐
function(tweet)
{

words
<-‐
strsplit(tweet,
"s+")[[1]]

good
way
to

cs
<-‐
colSums(do.call('rbind',

lapply(words,
function(z)

swn.match(z))))

iden3fy


boHlenecks*

}

score.swn.2
<-‐
function(tweet)
{

•  461
sec
with

words
<-‐
strsplit(tweet,
"s+")[[1]]

rows
<-‐
match(words,
swn$Term)

24
cores

rows
<-‐
rows[!is.na(rows)]

cs
<-‐
colSums(swn[rows,c(3,4)])


}

*
overkill
for
this
example

Looking
at
the
Scores

•  Bulk
of
the
tweets
2.5

are
neutral
2.0

•  Similar
behavior
Method

density
1.5
SWN

from
either

Breen

1.0

scoring
func3on
0.5

0.0

-6 -4 -2 0 2 4 6
Sentiment Scores
d$swn
<-‐
unlist(scores.swn)

d$breen
<-‐
unlist(scores.breen)

tmp
<-‐
rbind(data.frame(Method='SWN',
Scores=d$swn),

data.frame(Method='Breen',
Scores=d$breen))

ggplot(tmp,
aes(x=Scores,
fill=Method))
+

geom_density(alpha=0.25)
+

xlab("Sentiment
Scores")

Sen3ment
&
Time
of
Day

•  Group
tweets
by
hour
and
evaluate
how

propor3ons
of
posi3ve,
nega3ve,
etc
vary
.

tmp
<-‐
d

tmp$hour
<-‐
strptime(d$time,
format='%a,
%d
%b
%Y
%H:%M')$hour

tmp
<-‐
subset(tmp,
!is.na(swn))

tmp$status
<-‐
sapply(tmp$swn,
function(x)
{

if
(x
>
0)
return("Positive")

else
if
(x
<
0)
return("Negative")

else
return("Neutral")

})

tmp
<-‐
data.frame(do.call('rbind',

by(tmp,
tmp$hour,
function(x)
table(x$status))))

tmp$Hour
<-‐
factor(rownames(tmp),
levels=0:23)

tmp
<-‐
melt(tmp,
id='Hour',
variable_name='Sentiment')

ggplot(tmp,
aes(x=Hour,y=value,fill=Sentiment))+geom_bar(position='fill')+

xlab("")+ylab("Proportion")

Sen3ment
&
Time
of
Day

1.0

0.8

0.6 Sentiment
Proportion

Negative
Neutral

0.4 Positive

0.2

0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Contradic3ons?

•  Tweets
that
are
nega3ve
according
to
one

score
but
posi3ve
according
to
another

subset(d,
swn
<
-‐2
&
breen
>
1)

"i
m
trying
to
get
some
legit
food
right
now
like
pizza
or
chicken
not
this
shi7y
ass
school
lunch”

"24
i
like
reading
25
i
hate
hopsin
26
i
love
chips
salsa
27
i
love
chevys
28
i

was
a
thug
in
middle
school
29
i
love
pizza”

"@naturesempwm
had
a
raw
pizza
4
lunch
today
but
i
was
not
impressed
with
the
dried
out

not
fresh
vegetable
spring
roll
i
bought
threw
out
"

Sen3ment
and
Geography

•  What’s
the
spa3al
distribu3on
of
tweet

sen3ment?

•  Extract
tweets
located
in
the
CONUS
(~
500)

•  Visualize
the
direc3on
and
strength
of

sen3ments
swn

•  Correlate
with

-1
0
1

other
socio-‐

2

abs(swn)

economic
factors?

0.0
0.5
1.0
1.5
2.0

Other
Considera3ons

•  Should
take
into
account
nega3on

–  Scan
for
nega3on
terms
and
adjust
score

appropriately

•  Oblivious
to
sarcasm

•  Sen3ment
scores
should
probably
be
modiﬁed

by
context

•  Lots
of
M/L
opportuni3es

–  Spa3al
analysis

–  Topic
modeling
/
clustering

–  Predic3ve
models

Quantifying Text Sentiment in R Using Dictionaries and Scoring Functions

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Mehr von Rajarshi Guha

Mehr von Rajarshi Guha (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Quantifying Text Sentiment in R Using Dictionaries and Scoring Functions