SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Downloaden Sie, um offline zu lesen
Mining socio-political and socio-economic
signals from social media content
Vasileios Lampos
Department of Computer Science
University College London
Summer School on
“Big Data & Networks in Social Sciences”
University of Warwick, Sept. 21-23, 2016
@lampos | lampos.net
Structure of the presentation
1. Introductory remarks
2. Collective inference tasks

— Mining emotions

— Modelling voting intention
3. Personalised inference tasks

— Occupational class

— Income

— Socioeconomic status
4. Concluding remarks
Context and motivation
How can we use online user-generated content to
enhance our understanding about our world?
the Internet, the World Wide Web, connectivity
numerous web products feeding from user activity
user-generated content, publicly available, esp. on
social media platforms (e.g. Twitter)
large-scale digitised data, ‘Big Data’, ‘Data Science’
Context and motivation
How can we use online user-generated content to
enhance our understanding about our world?
the Internet, the World Wide Web, connectivity
numerous web products feeding from user activity
user-generated content, publicly available, esp. on
social media platforms (e.g. Twitter)
large-scale digitised data, ‘Big Data’, ‘Data Science’
About Twitter
About Twitter
> 140 characters per published status (tweet)
> users can follow and be followed
> embedded usage of topics (using #hashtags)
> user interaction (re-tweets, @mentions, likes)
> real-time nature
> biased demographics (13-15% of UK’s
population, age bias etc.)
> information is noisy and not always accurate
Inferring collective information 

from user-generated content
Lampos (Ph.D. Thesis, 2012)
Lansdall-Welfare, Lampos & Cristianini (WWW 2012)
Lampos, Preotiuc-Pietro & Cohn (ACL 2013)
mood / emotions
voting intention
Emotion taxonomies and quantification
‘Emotional’ keywords, representing
+ anger, e.g. angry, irritate
+ fear, e.g. fearful, afraid
+ joy, e.g. cheerful, enthusiastic
+ sadness, e.g. depressed, gloomy
+ plus other emotions
> WordNet Affect
> Linguistic Inquiry and Word Count (LIWC)
(Strapparava & Valitutti, 2004; Pennebaker et al., 2001, 2007)
Simply — but maybe not good enough! — we compute
the mean keyword frequency score per emotion
Emotion taxonomies and quantification
> WordNet Affect
> Linguistic Inquiry and Word Count (LIWC)
(Strapparava & Valitutti, 2004; Pennebaker et al., 2001, 2007)
Simply — but maybe not good enough! — we compute
the mean keyword frequency score per emotion
‘Emotional’ keywords, representing
+ anger, e.g. angry, irritate
+ fear, e.g. fearful, afraid
+ joy, e.g. cheerful, enthusiastic
+ sadness, e.g. depressed, gloomy
+ plus other emotions
Circadian emotion patterns from Twitter (UK)
Winter Summer Aggregated Data
SadnessScore
3 6 9 12 15 18 21 24
-0.1
0
0.1
3 6 9 12 15 18 21 24
-0.1
0
0.1
JoyScore
3 6 9 12 15 18 21 24
-0.1
0
0.1
3 6 9 12 15 18 21 24
-0.1
0
0.1
Hourly Intervals Hourly Intervals
re 1. Plots representing the variation over a 24-hour period of the emotional valen
ear, sadness, joy and anger. The red line represents days in the winter, while the green one
sents days in the summer. The average circadian pattern was extracted by aggregating the two
nal data sets. Faded colourings represent the SE of the sample mean.
24h emotion patterns for ‘joy’ and ‘sadness’ for summer
and winter with 95% confidence intervals
‘Joy’ time series based on Twitter (UK)
27august2012
time, in a manner that is partly predictable (or
at least interpretable). We were reassured to
find there was a periodic peak of joy around
It is interesting that the growth in anger
seems to have started before the riots them-
selves, but this does not mean that we could
Figure 2. Plot of the time series representing levels of joy estimator over a period of 2½ years. Notice the peaks
corresponding to Christmas and New Year, Valentine’s day and the Royal Wedding
Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 12
−2
0
2
4
6
8
10
933 Day Time Series for Joy in Twitter Content
Date
NormalisedEmotionalValence
* RIOTS
* CUTS
* XMAS
* XMAS
* XMAS
* roy.wed.
* halloween
* halloween
* halloween
* valentine* valentine
* easter
* easter
raw joy signal
14−day smoothed joy
Clear peaking pattern during XMAS or other annual
celebrations (Valentine’s Day, Easter)
Recession, riots, and Twitter emotions (UK)
Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 12
−1
−0.5
0
0.5
1
1.5
Rate of Mood Change by Day using the Difference in 50−day Mean
Date
Differenceinmean
Anger
Fear
Date of Budget Cuts
Date of Riots
Riots (UK)Budget Cuts (UK)
Difference in mean mood score 50 days prior and after
each date; peaks indicate increase in mood change
Inferring voting intention — Data sets
+ 3 political parties (Conservatives, Labour, Lib Dem)
+ 42,000 Twitter users distributed proportionally to UK’s
regional population figures
+ 60 million tweets, 80,976 1-grams
+ 240 polls from 30 Apr. 2010 to 13 Feb. 2012
United Kingdom
+ 4 political parties (SPO, OVP, FPO, GRU)
+ 1,100 active Twitter users selected by political scientists
+ 800,000 tweets, 22,917 1-grams
+ 98 polls from 25 Jan. to 25 Dec. 2012
Austria
Regularised text regression✓v =
qv q⇤
v
q⇤
v
xi 2 Rm, i 2 {1, . . . , n} — X
yi 2 R, i 2 {1, . . . , n} — y
wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ]
f(xi) = xT
i w +
✓v =
qv qv
q⇤
v
xi 2 Rm, i 2 {1, . . . , n} — X
yi 2 R, i 2 {1, . . . , n} — y
wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ]
f(xi) = xT
i w +
qv
xi 2 Rm, i 2 {1, . . . , n} — X
yi 2 R, i 2 {1, . . . , n} — y
wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ]
f(xi) = xT
i w +
observations
responses
weights, bias
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
Elastic Net
argmin
w,
8
<
:
nX
i=1
0
@yi
mX
j=1
xijwj
1
A
2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
9
=
;
L1-norm L2-norm
(Zou & Hastie, 2005)
Regularised text regression✓v =
qv q⇤
v
q⇤
v
xi 2 Rm, i 2 {1, . . . , n} — X
yi 2 R, i 2 {1, . . . , n} — y
wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ]
f(xi) = xT
i w +
✓v =
qv qv
q⇤
v
xi 2 Rm, i 2 {1, . . . , n} — X
yi 2 R, i 2 {1, . . . , n} — y
wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ]
f(xi) = xT
i w +
qv
xi 2 Rm, i 2 {1, . . . , n} — X
yi 2 R, i 2 {1, . . . , n} — y
wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ]
f(xi) = xT
i w +
observations
responses
weights, bias
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
Elastic Net
) = xT
i w +
i) = uTQiw +
i) = tr UTQiW +
argmin
w,
8
<
:
nX
i=1
0
@yi
mX
j=1
xijwj
1
A
2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
9
=
;
L1-norm L2-norm
(Zou & Hastie, 2005)
Bilinear (users+text) regularised regression
p 2 Z+
Qi 2 Rp⇥m, i 2 {1, . . . , n} — X
yi 2 R, i 2 {1, . . . , n} — y
uk, wj, 2 R, k 2 {1, . . . , p} — u, w,
j 2 {1, . . . , m}
users
observations
responses
weights, bias
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
Bilinear Text Regression — The general idea (2/2)
• users p œ Z+
• observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX
• responses yi œ R, i œ {1, ..., n} — yyy
• weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, —
j œ {1, ..., m}
f (QQQi) = uuuTQQQiwww + —
◊ ◊ + —
uuuT QQQi
www
f(xi) = xT
i w +
f (Qi) = uTQiw +
f(xi) = xT
i w +
f (Qi) = uTQiw +
f(xi) = xT
i w +
f (Qi) = uTQiw +
Bilinear elastic net (BEN)f (QQQi) = uuuTQQQiwww + —
◊ ◊ + —
uuuT QQQi
www
v.lampos@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 16/45
16
/4
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
where
= tr UTQiW +
argmin
w,
8
<
:
nX
i=1
0
@yi
mX
j=1
xijwj
1
A
2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
9
=
;
argmin
u,w,
( nX
i=1
uT
Qiw + yi
2
+ (u, ✓u) + (w, ✓w)
)
(x, 1, 2) = 1kxk`1 + 2kxk2
`2
= uTQiw +
= tr UTQiW +
argmin
w,
8
<
:
nX
i=1
0
@yi
mX
j=1
xijwj
1
A
2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
9
=
;
argmin
u,w,
( nX
i=1
uT
Qiw + yi
2
+ (u, ✓u) + (w, ✓w)
)
(x, 1, 2) = 1kxk`1 + 2kxk2
`2
Training bilinear elastic net (BEN)
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
0
0.4
0.8
1.2
1.6
2
2.4
Step
Global Objective
RMSE
Global objective function
during training (red)
Corresponding prediction
error on held out data (blue)
Biconvex problem
+ fix u, learn w and vice versa
+ iterate through convex optimisation tasks
(Mairal et al., 2010)Large-scale solvers in SPAMS
argmin
w, :
i=1
@yi
j=1
xijwjA + 1
j=1
|wj| + 2
j=1
w2
j
;
argmin
u,w,
( nX
i=1
uT
Qiw + yi
2
+ (u, ✓u) + (w, ✓w)
)
(x, 1, 2) = 1kxk`1 + 2kxk2
`2
Training bilinear elastic net (BEN)
stic Net (BEN)
ÿ 1
uuuT
QQQiwww + — ≠ yi
22
BEN’s objective function
u1 ÎuuuÎ2
¸2
+ ⁄u2 Îuuuθ1
w1 ÎwwwÎ2
¸2
+ ⁄w2 Îwwwθ1
J
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
0
0.4
0.8
1.2
1.6
2
2.4
Step
Global Objective
RMSE
xity: fix uuu, learn www and vv
through convex
on tasks: convergence
Global objective function
during training (red)
Corresponding prediction
error on held out data (blue)
Biconvex problem
+ fix u, learn w and vice versa
+ iterate through convex optimisation tasks
(Mairal et al., 2010)Large-scale solvers in SPAMS
argmin
w, :
i=1
@yi
j=1
xijwjA + 1
j=1
|wj| + 2
j=1
w2
j
;
argmin
u,w,
( nX
i=1
uT
Qiw + yi
2
+ (u, ✓u) + (w, ✓w)
)
(x, 1, 2) = 1kxk`1 + 2kxk2
`2
Bilinear and multi-task regression
⌧ 2 Z+
p 2 Z+
Qi 2 Rp⇥m, i 2 {1, . . . , n} — X
yi 2 R⌧ , i 2 {1, . . . , n} — Y
uk, wj, 2 R⌧ , k 2 {1, . . . , p} — U, W,
j 2 {1, . . . , m}
tasks
users
observations
responses
weights, bias
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
Bilinear Multi-Task Learning
• tasks · œ Z+
• users p œ Z+
• observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX
• responses yyyi œ R· , i œ {1, ..., n} — YYY
• weights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ———
j œ {1, ..., m}
f (QQQi) = tr
1
UUUTQQQiWWW
2
+ ———
◊ ◊
UUUT QQQi WWW
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
f(xi) = xT
i w +
f (Qi) = uTQiw +
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
Bilinear Group L2,1 (BGL)
+ a nonzero weighted feature (user or word) is
encouraged to be nonzero for all tasks, but with
potentially different weights
+ intuitive for political preference inference
f (QQQi) = tr
1
UUUTQQQiWWW
2
+ ———
◊ ◊
UUUT QQQi WWW
v.lampos@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 23/45
23
/45
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
f(xi) = xT
i w +
f (Qi) = uTQiw +
f (Qi) = tr UTQiW +
argmin
w, :
i=1
@yi
j=1
xijwjA + 1
j=1
|wj| + 2
j=1
wj
;
argmin
u,w,
( nX
i=1
uT
Qiw + yi
2
+ (u, ✓u) + (w, ✓w)
)
(x, 1, 2) = 1kxk`1 + 2kxk2
`2
argmin
U,W,
8
<
:
⌧X
t=1
nX
i=1
uT
Qiwt + t yti
2
+ u
pX
k=1
kUkk2 + w
mX
j=1
kWjk2
9
=
;
Voting intention inference performanceRootMeanSquaredError
0
1
2
2
3
UK Austria
1.4391.478
1.699
1.573
1.442
3.067
1.47
1.723
1.851
1.69
Mean poll
Last poll
Elastic Net (words)
BEN
BGL
Voting intention inference performanceRootMeanSquaredError
0
1
2
2
3
UK Austria
1.4391.478
1.699
1.573
1.442
3.067
1.47
1.723
1.851
1.69
Mean poll
Last poll
Elastic Net (words)
BEN
BGL
Voting intention comparative plots
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIB
BEN
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIBBGL
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIBYouGov
Voting intention comparative plots
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
VotingIntention%
Time
SPÖ
ÖVP
FPÖ
GRÜ
Polls
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
VotingIntention%
Time
SPÖ
ÖVP
FPÖ
GRÜ
BEN
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
VotingIntention%
Time
SPÖ
ÖVP
FPÖ
GRÜ
BGL
Qualitative insights
Party Tweet Score User type
SPÖ
centre
Inflation rate in Austria slightly down in
July from 2.2 to 2.1%. Accommodation,
Water, Energy more expensive.
0.745 Journalist
ÖVP
centre
right
Can really recommend the book “Res
Publica” by Johannes #Voggenhuber! Food
for thought and so on #Europe #Democracy
-2.323
Normal
user
FPÖ

far right
Campaign of the Viennese SPO on “Living
together” plays right into the hands of right-
wing populists
-3.44
Human
rights
GRÜ
centre
left
Protest songs against the closing-down of the
bachelor course of International
Development: <link> #ID_remains
#UniBurns #UniRage
1.45
Student
Union
Inferring user-level information 

from user-generated content
Preotiuc-Pietro, Lampos & Aletras (ACL 2015)
Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras
(PLOS ONE, 2015)
Lampos, Aletras, Geyti, Zou & Cox (ECIR 2016)
occupational class
income
socio-economic status (SES)
Linguistic expression and demographics
“Socioeconomic variables are influencing language use.”
+ Validate this hypothesis on a broader,
larger data set using social media
+ Applications
> research, as in computational social
science, health, and psychology
> commercial
(Bernstein, 1960; Labov, 1972/2006)
Standard Occupational Classification (SOC)
Major Group 1 (C1): Managers, Directors and Senior Officials
Sub-major Group 11: Corporate Managers and Directors
Minor Group 111: Chief Executives and Senior Officials
Unit Group 1115: Chief Executives and Senior Officials
•Job: chief executive, bank manager
Unit Group 1116: Elected Officers and Representatives
Minor Group 112: Production Managers and Directors
Minor Group 113: Functional Managers and Directors
Minor Group 115: Financial Institution Managers and Directors
Minor Group 116: Managers and Directors in Transport and Logistics
Minor Group 117: Senior Officers in Protective Services
Minor Group 118: Health and Social Services Managers and Directors
Minor Group 119: Managers and Directors in Retail and Wholesale
Sub-major Group 12: Other Managers and Proprietors
Major Group (C2): Professional Occupations
•Job: mechanical engineer, pediatrist
Major Group (C3): Associate Professional and Technical Occupations
•Job: system administrator, dispensing optician
Major Group (C4): Administrative and Secretarial Occupations
•Job: legal clerk, company secretary
Major Group (C5): Skilled Trades Occupations
•Job: electrical fitter, tailor
Major Group (C6): Caring, Leisure and Other Service Occupations
•Job: nursery assistant, hairdresser
Major Group (C7): Sales and Customer Service Occupations
•Job: sales assistant, telephonist
Major Group (C8): Process, Plant and Machine Operatives
•Job: factory worker, van driver
Major Group (C9): Elementary Occupations
•Job: shelf stacker, bartender
9 major groups
25 sub-major groups
90 minor groups
369 unit groups
provided by the
Office for National
Statistics (UK)
Standard Occupational Classification (SOC)
C1 — Managers, Directors & Senior Officials
(chief executive, bank manager)
C2 — Professional Occupations (postdoc, pediatrist)
C3 — Associate Professional & Technical
(system administrator, dispensing optician)
C4 — Administrative & Secretarial (legal clerk, secretary)
C5 — Skilled Trades (electrical fitter, tailor)
C6 — Caring, Leisure, Other Service
(nursery assistant, hairdresser)
C7 — Sales & Customer Service (sales assistant, telephonist)
C8 — Process, Plant and Machine Operatives
(factory worker, van driver)
C9 — Elementary (shelf stacker, bartender)
The 9 major occupational classes (C1-9)
Forming a Twitter user data set
+ 5,191 Twitter users mapped to their occupations,
then mapped to one of the 9 SOC categories
+ 10 million tweets
+ Download the data set
% of users per SOC category
0
7
14
21
28
35
C1 C2 C3 C4 C5 C6 C7 C8 C9
Twitter user attributes (18 in total)
number of
— followers
— friends
— followers/friends (ratio)
— times listed
— tweets
— favourites (likes)
— unique @-mentions
— tweets/day (avg.)
— retweets/tweet (avg.)
proportion of
— retweets done
— non duplicate tweets
— retweeted tweets
— hashtags
— tweets with hashtags
— tweets with @-mentions
— @-replies
— tweets with links
— tweets in English
Similarly to our paper
for user impact estimation
(Lampos et al., 2014)
Twitter user discussion topics (I)
Topics — Word clusters (#: 30, 50, 100, 200)
+ SVD on the graph laplacian of the word by word
similarity matrix using normalised PMI, i.e. a
form of spectral clustering
+ Word2vec (skip-gram with negative sampling) to
learn word embeddings; pairwise cosine
similarity on the embeddings to derive a word by
word similarity matrix; then spectral clustering on
the similarity matrix
(Bouma, 2009; von Luxburg, 2007)
(Mikolov et al., 2013)
Twitter user discussion topics (II)
Topic Most central words; Most frequent words
Arts archival, stencil, canvas, minimalist; art, design, print
Health chemotherapy, diagnosis, disease; risk, cancer, mental, stress
Beauty Care exfoliating, cleanser, hydrating; beauty, natural, dry, skin
Higher
Education
undergraduate, doctoral, academic, students, curriculum;
students, research, board, student, college, education, library
Football bardsley, etherington, gallas; van, foster, cole, winger
Corporate consortium, institutional, firm’s; patent, industry, reports
Elongated
Words
yaaayy, wooooo, woooo, yayyyyy, yaaaaay, yayayaya, yayy;
wait, till, til, yay, ahhh, hoo, woo, woot, whoop, woohoo
Politics
religious, colonialism, christianity, judaism, persecution,
fascism, marxism; human, culture, justice, religion, democracy
A few words about Gaussian Processes
Why do we use Gaussian Processes?
+ Kernelised, models nonlinearities
+ Interpretability (AutoRelevance Determination)
+ Performance
re-
he
2)
rd.
eu-
ed
a a
els
on
Formally, GP methods aim to learn a function
f : Rd ! R drawn from a GP prior given the
inputs xxx 2 Rd:
f(xxx) ⇠ GP(m(xxx), k(xxx,xxx0
)) , (3)
where m(·) is the mean function (here 0) and k(·, ·)
is the covariance kernel. Usually, the Squared Ex-
ponential (SE) kernel (a.k.a. RBF or Gaussian) is
used to encourage smooth functions. For the multi-
dimensional pair of inputs (xxx,xxx0), this is:
kard(xxx,xxx0
) = 2
exp
" dX
i
(xi x0
i)2
2l2
i
#
, (4)
on, while the clusters can be interpreted
ing a list of the most frequent or repre-
words. The latter are identified using the
centrality metric:
Cw =
P
x2c NPMI(w, x)
|c| 1
, (2)
enotes the cluster and w the target word.
ural Embeddings (W2V-E)
there has been a growing interest in neu-
ge models, where the words are projected
er dimensional dense vector space via a
er (Mikolov et al., 2013b). These models
and Cohn, 2013) with
limited to (Polajnar et
Formally, GP meth
f : Rd ! R drawn
inputs xxx 2 Rd:
f(xxx) ⇠ GP
where m(·) is the mean
is the covariance kern
ponential (SE) kernel
used to encourage smo
dimensional pair of in
kard(xxx,xxx0
) = 2
exp
e-
he
2)
d.
u-
d
a
ls
n
Formally, GP methods aim to learn a function
f : Rd ! R drawn from a GP prior given the
inputs xxx 2 Rd:
f(xxx) ⇠ GP(m(xxx), k(xxx,xxx0
)) , (3)
where m(·) is the mean function (here 0) and k(·, ·)
is the covariance kernel. Usually, the Squared Ex-
ponential (SE) kernel (a.k.a. RBF or Gaussian) is
used to encourage smooth functions. For the multi-
dimensional pair of inputs (xxx,xxx0), this is:
kard(xxx,xxx0
) = 2
exp
" dX
i
(xi x0
i)2
2l2
i
#
, (4)
where li are lengthscale parameters learnt only
Say and we want to learn
Formally: Sets of random variables any finite number of
which have a multivariate Gaussian distribution
mean function
drawn on inputs
covariance function (kernel)
drawn on pairs of inputs
(Rasmussen & Williams, 2006)
More information about Gaussian Processes
+ Book: “Gaussian Processes for Machine Learning”

http://www.gaussianprocess.org/gpml/	
+ Video-lecture: “Gaussian Process Basics”

http://videolectures.net/gpip06_mackay_gpb/	
+ Tutorial tailored to statistical NLP tasks: “Gaussian
Processes for Natural Language Processing”

http://people.eng.unimelb.edu.au/tcohn/tutorial.html
+ Software I — GPML for Octave or MATLAB

http://www.gaussianprocess.org/gpml/code
+ Software II — GPy for Python

http://sheffieldml.github.io/GPy/
Gaussian Process classifierin neu-
ojected
ce via a
models
ntation
models
syntac-
xt, han-
dimen-
nce are
he skip-
ov et al.,
Twitter
m model
evy and
used to encourage smooth functions. For the multi-
dimensional pair of inputs (xxx,xxx0), this is:
kard(xxx,xxx0
) = 2
exp
" dX
i
(xi x0
i)2
2l2
i
#
, (4)
where li are lengthscale parameters learnt only
using training data by performing gradient as-
cent on the type-II marginal likelihood. Intuitively,
the lengthscale parameter li controls the variation
along the i input dimension, i.e. a low value makes
the output very sensitive to input data, thus mak-
ing that input more useful for the prediction. If the
lengthscales are learnt separately for each input
dimension the kernel is named SE with Automatic
Relevance Determination (ARD) (Neal, 1996).
+ Squared-exponential ARD covariance function:
determines (quantify) the relevancy of each user
feature, i.e. the relevance of feature i is
inversely proportional to the length-scale
hyper-parameter li
+ 9-class classification using one vs. all
+ GP hyper-parameter learning with Expectation 

Propagation
+ Inference using FITC (500 inducing points)
Occupation classification performance
Accuracy(%)
25
31
37
43
49
55
User Attributes Topics (SVD) Topics (word2vec)
52.7
48.2
34.2
51.7
47.9
31.5
46.9
44.2
34
Logistic Regression SVM (RBF)
Gaussian Process (SE-ARD)
most frequent
class baseline
(34.4%)
Occupation classification performance
Accuracy(%)
25
31
37
43
49
55
User Attributes Topics (SVD) Topics (word2vec)
52.7
48.2
34.2
51.7
47.9
31.5
46.9
44.2
34
Logistic Regression SVM (RBF)
Gaussian Process (SE-ARD)
most frequent
class baseline
(34.4%)
Occupation classification performance
Accuracy(%)
25
31
37
43
49
55
User Attributes Topics (SVD) Topics (word2vec)
52.7
48.2
34.2
51.7
47.9
31.5
46.9
44.2
34
Logistic Regression SVM (RBF)
Gaussian Process (SE-ARD)
most frequent
class baseline
(34.4%)
Occupation classification insights (I)
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Higher Education (#21)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right corner
CDF of the topic “Higher Education”: Topic more prevalent
in the upper classes (C2, which includes education
professionals, and C1), and less so in the lower classes
Occupation classification insights (II)
CDF of the topic “Arts”: Topic more prevalent in C5 (which
includes artists) and the upper classes
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Arts (#116)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right corner
Occupation classification insights (II)
CDF of the topic “Arts”: Topic more prevalent in C5 (which
includes artists) and the upper classes
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Arts (#116)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right corner
Occupation classification insights (III)
CDF of the topic “Elongated Words”: Topic more prevalent
in the lower classes, and less so in the upper classes
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Elongated Words (#164)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right corner
Occupation classification insights (III)
CDF of the topic “Elongated Words”: Topic more prevalent
in the lower classes, and less so in the upper classes
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Elongated Words (#164)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right corner
Occupation classification insights (IV)
Topic distribution distance (Jensen-Shannon divergence)
for the different occupational classes (1-9)
Higher Ed-
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupational Class
OccupationalClass
Occupation classification insights (IV)
Topic distribution distance (Jensen-Shannon divergence)
for the different occupational classes (1-9)
Higher Ed-
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupational Class
OccupationalClass
Occupation classification insights (IV)
Topic distribution distance (Jensen-Shannon divergence)
for the different occupational classes (1-9)
Occupational Class
OccupationalClass
Higher Ed-
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupation classification insights (V)
Health
Beauty Care
Education
Football*
Corporate
Elongated Words
Politics
Topic scores for occupational class supersets
1.06
3.78
1.41
1.04
2.56
2.24
2.13
2.14
1.9
5.15
1.08
6.04
1.4
4.45
Classes 1-2 Classes 6-9
* times 2 for visualisation purposes
Additional ‘perceived’ user features
+ Previously used features: Profile features, Shallow
profile features, and Topics
+ Based on the work of Volkova et al. (2015), we also
incorporated:
> Inferred Psycho-Demographic features (15)

e.g. gender, age, education level, religion, life
satisfaction, excitement, anxiety etc.
> Emotions (9)

e.g. positive / negative sentiment, joy, anger, fear,
disgust, sadness, surprise etc.
Defining the user income regression task
(e.g. ‘spare time guitarist’), contained multiple possible jobs (e.g. ‘marketer, social media ana-
lyst’) or represented an institutional account (e.g. ‘limo driver company’). In total, around 50%
Table 1. Subset of the SOC classification hierarchy.
Group 112: Production Managers and Directors (50,952 GBP/year)
•Job titles: engineering manager, managing director, production manager, construction manager, quarry
manager, operations manager
Group 241: Conservation and Environment Professionals (53,679 GBP/year)
•Job titles: conservation officer, ecologist, energy conservation officer, heritage manager, marine
conservationist, energy manager, environmental consultant, environmental engineer, environmental
protection officer, environmental scientist, landfill engineer
Group 312: Draughtspersons and Related Architectural Technicians (29,167 GBP/year)
•Job titles: architectural assistant, architectural, technician, construction planner, planning enforcement
officer, cartographer, draughtsman, CAD operator
Group 411: Administrative Occupations: Government and Related Organisations (20,373 GBP/year)
•Job titles: administrative assistant, civil servant, government clerk, revenue officer, benefits assistant,
trade union official, research association secretary
Group 541: Textiles and Garments Trades (18,986 GBP/year)
•Job titles: knitter, weaver, carpet weaver, curtain maker, upholsterer, curtain fitter, cobbler, leather
worker, shoe machinist, shoe repairer, hosiery cutter, dressmaker, fabric cutter, tailor, tailoress, clothing
manufacturer, embroiderer, hand sewer, sail maker, upholstery cutter
Group 622: Hairdressers and Related Services (10,793 GBP/year)
•Job titles: barber, colourist, hair stylist, hairdresser, beautician, beauty therapist, nail technician, tattooist
Group 713: Sales Supervisors (18,383 GBP/year)
•Job titles: sales supervisor, section manager, shop supervisor, retail supervisor, retail team leader
Group 813: Assemblers and Routine Operatives (22,491 GBP/year)
•Job titles: assembler, line operator, solderer, quality assurance inspector, quality auditor, quality
controller, quality inspector, test engineer, weightbridge operator, type technician
Group 913: Elementary Process Plant Occupations (17,902 GBP/year)
•Job titles: factory cleaner, hygene operator, industrial cleaner, paint filler, packaging operator, material
handler, packer
doi:10.1371/journal.pone.0138717.t001
Same Twitter data
set as in the job
classification task
Use an income
mapping from
SOC to create
real-valued target
data for the
regression task
User income regression: data
Income prediction
10k 30k 50k 100k
0
200
400
600
800
1000
Yearly income (£)
No.Users
We approach the task as regression.
+ 5,191 Twitter users
mapped to their
occupations, then
mapped to an
average income in
GBP (£) using the
SOC taxonomy
+ ~11 million tweets
+ Download the data
User income regression performance
MAE
£9,000
£9,500
£10,000
£10,500
£11,000
£11,500
Income inference error (Mean Absolute Error) using
GP regression or a linear ensemble for all features
Feature Categories
£9,535
£9,621
£11,456
£10,980
£10,110
£11,291
Profile Demo Emotions
Shallow Topics All features
User income regression insights (I)Studying User Income in Social Media
User income regression insights (II)
• Neutral sentiment increases with income, while both positive and negative sentiment
decrease. This uncovers that lower income users express more subjectivity online;
• Anger and fear emotions are more present in users with higher income while sadness, sur-
Fig 3. Linear and non-linear (GP) fit for Profile features. Variation of income as a function of user profile features. Linear fit in red, non-linear Gaussian
Process fit in black. Brackets show the GP lengthscales—the lower the value, the more important the feature is for prediction.
doi:10.1371/journal.pone.0138717.g003
Studying User Income in Social Media
Relating income and user attributes
Linear vs GP fit
User income regression insights (III)
e1: positive (l=46.27) e2: neutral (l=57.64) e3: negative(l=76.34)
e4: joy (l=36.37) e5: sadness (l=67.05) e6: disgust (l=116.66)
e7: anger (l=95.50) e8: surprise (l=83.61) e9: fear (l=31.74)
28000
35000
42000
28000
35000
42000
28000
35000
42000
0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 0.05 0.10 0.15 0.20
0.5 0.6 0.7 0.8 0.05 0.10 0.010 0.015 0.020 0.025 0.030
0.01 0.02 0.03 0.04 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15
Feature value
Income
Relating income and emotion
Linear vs GP fit
User income regression insights (IV)
Topic 107 (Justice) Topic 124 (Corporate 1) Topic 139 (Politics)
Topic 163 (NGOs) Topic 196 (Web analytics/Surveys) Topic 99 (Swearing)
30000
40000
50000
30000
40000
50000
0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.000 0.025 0.050 0.075
0.000 0.025 0.050 0.075 0.100 0.00 0.01 0.02 0.03 0.04 0.00 0.03 0.06 0.09 0.12
Feature value
Income
Relating income and topics of discussion
Linear vs GP fit
Defining a user SES classification task
Profile description
on Twitter
Occupation SOC category1 NS-SEC2
1. Standard Occupational Classification job groups
2. National Statistics Socio-Economic Classification:
Map from the job groups in the SOC to a
socioeconomic status (SES): upper, middle or lower
UK Twitter user data set for SES classification
+ 1,342 UK Twitter user profiles
+ 2 million tweets
+ Date interval: Feb. 1, 2014 to March 21, 2015
+ Labelled with a socioeconomic status (SES),
using the occupational class proxy from SOC and
NS-SEC: upper, middle, or lower
+ 1,291 user features following the previous
paradigms, i.e. quantifying behaviour, impact,
profile info, text in tweets and topics from tweets
+ Download the data set
SES classification performance
Classification Accuracy (%) Precision (%) Recall (%) F1
2 classes 82.05 (2.4) 82.2 (2.4) 81.97 (2.6) .821 (.03)
3 classes 75.09 (3.3) 72.04 (4.4) 70.76 (5.7) .714 (.05)
… using a Gaussian Process classifier
T1 T2 T3 P
O1 606 84 53 81.6%
O2 49 186 45 66.4%
O3 55 48 216 67.7%
R 854% 58.5% 68.8% 75.1%
3-class classification
T1 T2 P
O1 584 115 83.5%
O2 126 517 80.4%
R 82.3% 81.8% 82.0%
middle & lower merged
Conclusions — Mining socio-political and
socio-economic signals from social media
collective emotion
voting intention
occupational class
income
socio-economic status
Further thoughts
+ User-generated content is a valuable asset
+ Nonlinear models tend to perform better given
the multimodality of the feature space
+ Deeper representations of text tend to improve
performance
+ Qualitative analysis is important
> Evaluation
> Interesting insights
Some of the future research challenges
+ Work closer with domain experts
+ Better understanding of online media biases,
e.g. demographics, external influence etc.
+ Generalisation, defining limitations, more
rigorous evaluation frameworks
+ Methodological improvements
+ Ethical concerns
Acknowledgements
Currently funded by
All collaborators (in alphabetical order)
in research mentioned today
Nikolaos Aletras (Amazon)

Yoram Bachrach (Microsoft Research) 

Trevor Cohn (University of Melbourne)

Ingemar J. Cox (UCL)

Nello Cristianini (University of Bristol)
Daniel Preotiuc-Pietro (Penn)
Thomas Lansdall-Welfare (University of Bristol)
Svitlana Volkova (PNNL)
Bin Zou (UCL)
Thank you!
Any questions?
Slides can be downloaded from
lampos.net/talks
@lampos | lampos.net
References
Argyriou, Evgeniou & Pontil. Convex Multi-Task Feature Learning (Machine Learning, 2008)
Bernstein. Language and social class (Br J Sociol, 1960)
Bouma. Normalized (pointwise) mutual information in collocation extraction (GSCL, 2009)
Labov. The Social Stratification of English in New York City (Cambridge Univ Press, 1972; 2006, 2nd ed.)
Lampos. Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning
Methods (Ph.D. Thesis, University of Bristol, 2012)
Lampos, Aletras, Geyti, Zou & Cox. Inferring the Socioeconomic Status of Social Media Users based on Behaviour
and Language (ECIR, 2016)
Lampos, Preotiuc-Pietro, Aletras & Cohn. Predicting and Characterising User Impact on Twitter (EACL, 2014)
Lampos, Preotiuc-Pietro & Cohn. A user-centric of voting intention from Social Media (ACL, 2013)
Lansdall-Welfare, Lampos & Cristianini. Effects of the Recession on Public Mood in the UK (WWW, 2012)
Mairal, Jenatton, Obozinski & Bach. Network Flow Algorithms for Structured Sparsity (NIPS, 2010)
Mikolov, Chen, Corrado & Dean. Efficient estimation of word representations in vector space (ICLR, 2013)
Pennebaker, Booth & Francis. Linguistic Inquiry and Word Count: LIWC2007 (Tech. Report, 2001, 2007)
Preotiuc-Pietro, Lampos & Aletras. An analysis of the user occupational class through Twitter content (ACL, 2015)
Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras. Studying User Income through Language, Behaviour and
Affect in Social Media (PLoS ONE, 2015)
Rasmussen & Williams. Gaussian Processes for Machine Learning (MIT Press, 2006)
Strapparava & Valitutti. WordNet-Affect: An affective extension of WordNet. LREC, 2004.
Volkova, Bachrach, Armstrong & Sharma. Inferring Latent User Properties from Texts Published in Social Media
(AAAI, 2015)
von Luxburg. A tutorial on spectral clustering (Stat Comput, 2007)
Zou & Hastie. Regularization and variable selection via the elastic net (J R Stat Soc Series B Stat Methodol, 2005)

Weitere ähnliche Inhalte

Was ist angesagt?

The Phase Transition
The Phase TransitionThe Phase Transition
The Phase TransitionYandex
 
Various relations on new information
Various relations on new informationVarious relations on new information
Various relations on new informationijitjournal
 
Table 1
Table 1Table 1
Table 1butest
 
Thesis defense improved
Thesis defense improvedThesis defense improved
Thesis defense improvedZheng Mengdi
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Lesson 9: The Product and Quotient Rule
Lesson 9: The Product and Quotient RuleLesson 9: The Product and Quotient Rule
Lesson 9: The Product and Quotient RuleMatthew Leingang
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithmsguestfee8698
 
Per de matematica listo.
Per de matematica listo.Per de matematica listo.
Per de matematica listo.nohelicordero
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Lesson 12: The Product and Quotient Rule
Lesson 12: The Product and Quotient RuleLesson 12: The Product and Quotient Rule
Lesson 12: The Product and Quotient RuleMatthew Leingang
 
Neuronal self-organized criticality
Neuronal self-organized criticalityNeuronal self-organized criticality
Neuronal self-organized criticalityOsame Kinouchi
 
Lesson 9: The Product and Quotient Rules (slides)
Lesson 9: The Product and Quotient Rules (slides)Lesson 9: The Product and Quotient Rules (slides)
Lesson 9: The Product and Quotient Rules (slides)Matthew Leingang
 
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...Facultad de Informática UCM
 
Independent domination in finitely defined classes of graphs polynomial algor...
Independent domination in finitely defined classes of graphs polynomial algor...Independent domination in finitely defined classes of graphs polynomial algor...
Independent domination in finitely defined classes of graphs polynomial algor...政謙 陳
 
C7 7.4
C7 7.4C7 7.4
C7 7.4BGEsp1
 
cvpr2011: game theory in CVPR part 1
cvpr2011: game theory in CVPR part 1cvpr2011: game theory in CVPR part 1
cvpr2011: game theory in CVPR part 1zukun
 

Was ist angesagt? (17)

The Phase Transition
The Phase TransitionThe Phase Transition
The Phase Transition
 
Various relations on new information
Various relations on new informationVarious relations on new information
Various relations on new information
 
Table 1
Table 1Table 1
Table 1
 
Thesis defense improved
Thesis defense improvedThesis defense improved
Thesis defense improved
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
B010310813
B010310813B010310813
B010310813
 
Lesson 9: The Product and Quotient Rule
Lesson 9: The Product and Quotient RuleLesson 9: The Product and Quotient Rule
Lesson 9: The Product and Quotient Rule
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Per de matematica listo.
Per de matematica listo.Per de matematica listo.
Per de matematica listo.
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Lesson 12: The Product and Quotient Rule
Lesson 12: The Product and Quotient RuleLesson 12: The Product and Quotient Rule
Lesson 12: The Product and Quotient Rule
 
Neuronal self-organized criticality
Neuronal self-organized criticalityNeuronal self-organized criticality
Neuronal self-organized criticality
 
Lesson 9: The Product and Quotient Rules (slides)
Lesson 9: The Product and Quotient Rules (slides)Lesson 9: The Product and Quotient Rules (slides)
Lesson 9: The Product and Quotient Rules (slides)
 
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
 
Independent domination in finitely defined classes of graphs polynomial algor...
Independent domination in finitely defined classes of graphs polynomial algor...Independent domination in finitely defined classes of graphs polynomial algor...
Independent domination in finitely defined classes of graphs polynomial algor...
 
C7 7.4
C7 7.4C7 7.4
C7 7.4
 
cvpr2011: game theory in CVPR part 1
cvpr2011: game theory in CVPR part 1cvpr2011: game theory in CVPR part 1
cvpr2011: game theory in CVPR part 1
 

Ähnlich wie Mining socio-political and socio-economic signals from social media content

Get Accurate and Reliable Statistics Assignment Help - Boost Your Grades!
Get Accurate and Reliable Statistics Assignment Help - Boost Your Grades!Get Accurate and Reliable Statistics Assignment Help - Boost Your Grades!
Get Accurate and Reliable Statistics Assignment Help - Boost Your Grades!Statistics Assignment Help
 
differential-calculus-1-23.pdf
differential-calculus-1-23.pdfdifferential-calculus-1-23.pdf
differential-calculus-1-23.pdfIILSASTOWER
 
ゲーム理論BASIC 第32回 -市場ゲームとコア-
ゲーム理論BASIC 第32回 -市場ゲームとコア-ゲーム理論BASIC 第32回 -市場ゲームとコア-
ゲーム理論BASIC 第32回 -市場ゲームとコア-ssusere0a682
 
Bilinear text regression and applications
Bilinear text regression and applicationsBilinear text regression and applications
Bilinear text regression and applicationsVasileios Lampos
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchAhmed BESBES
 
Development of implicit rational runge kutta schemes for second order ordinar...
Development of implicit rational runge kutta schemes for second order ordinar...Development of implicit rational runge kutta schemes for second order ordinar...
Development of implicit rational runge kutta schemes for second order ordinar...Alexander Decker
 
Signals and systems: part i solutions
Signals and systems: part i solutionsSignals and systems: part i solutions
Signals and systems: part i solutionsPatrickMumba7
 
ゲーム理論BASIC 第43回 -シャープレイ値-
ゲーム理論BASIC 第43回 -シャープレイ値-ゲーム理論BASIC 第43回 -シャープレイ値-
ゲーム理論BASIC 第43回 -シャープレイ値-ssusere0a682
 
Signals and Systems part 2 solutions
Signals and Systems part 2 solutions Signals and Systems part 2 solutions
Signals and Systems part 2 solutions PatrickMumba7
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...André Panisson
 
Continuous Time Convolution1. Solve the following for y(t.docx
Continuous Time Convolution1. Solve the following for y(t.docxContinuous Time Convolution1. Solve the following for y(t.docx
Continuous Time Convolution1. Solve the following for y(t.docxmaxinesmith73660
 

Ähnlich wie Mining socio-political and socio-economic signals from social media content (20)

Get Accurate and Reliable Statistics Assignment Help - Boost Your Grades!
Get Accurate and Reliable Statistics Assignment Help - Boost Your Grades!Get Accurate and Reliable Statistics Assignment Help - Boost Your Grades!
Get Accurate and Reliable Statistics Assignment Help - Boost Your Grades!
 
differential-calculus-1-23.pdf
differential-calculus-1-23.pdfdifferential-calculus-1-23.pdf
differential-calculus-1-23.pdf
 
ゲーム理論BASIC 第32回 -市場ゲームとコア-
ゲーム理論BASIC 第32回 -市場ゲームとコア-ゲーム理論BASIC 第32回 -市場ゲームとコア-
ゲーム理論BASIC 第32回 -市場ゲームとコア-
 
Bilinear text regression and applications
Bilinear text regression and applicationsBilinear text regression and applications
Bilinear text regression and applications
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
Development of implicit rational runge kutta schemes for second order ordinar...
Development of implicit rational runge kutta schemes for second order ordinar...Development of implicit rational runge kutta schemes for second order ordinar...
Development of implicit rational runge kutta schemes for second order ordinar...
 
Signals and systems: part i solutions
Signals and systems: part i solutionsSignals and systems: part i solutions
Signals and systems: part i solutions
 
ficha pedagogica, matematicas
ficha pedagogica, matematicasficha pedagogica, matematicas
ficha pedagogica, matematicas
 
pres06-main
pres06-mainpres06-main
pres06-main
 
Ch4
Ch4Ch4
Ch4
 
ゲーム理論BASIC 第43回 -シャープレイ値-
ゲーム理論BASIC 第43回 -シャープレイ値-ゲーム理論BASIC 第43回 -シャープレイ値-
ゲーム理論BASIC 第43回 -シャープレイ値-
 
Prelude to halide_public
Prelude to halide_publicPrelude to halide_public
Prelude to halide_public
 
Signals and Systems part 2 solutions
Signals and Systems part 2 solutions Signals and Systems part 2 solutions
Signals and Systems part 2 solutions
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
 
Classification
ClassificationClassification
Classification
 
Forecast2007
Forecast2007Forecast2007
Forecast2007
 
Continuous Time Convolution1. Solve the following for y(t.docx
Continuous Time Convolution1. Solve the following for y(t.docxContinuous Time Convolution1. Solve the following for y(t.docx
Continuous Time Convolution1. Solve the following for y(t.docx
 
50320140501001
5032014050100150320140501001
50320140501001
 
Ijetcas14 605
Ijetcas14 605Ijetcas14 605
Ijetcas14 605
 
Bc4103338340
Bc4103338340Bc4103338340
Bc4103338340
 

Mehr von Vasileios Lampos

Transfer learning for unsupervised influenza-like illness models from online ...
Transfer learning for unsupervised influenza-like illness models from online ...Transfer learning for unsupervised influenza-like illness models from online ...
Transfer learning for unsupervised influenza-like illness models from online ...Vasileios Lampos
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applicationsVasileios Lampos
 
Mining online data for public health surveillance
Mining online data for public health surveillanceMining online data for public health surveillance
Mining online data for public health surveillanceVasileios Lampos
 
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...Vasileios Lampos
 
An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...Vasileios Lampos
 
Extracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual dataExtracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual dataVasileios Lampos
 
Assessing the impact of a health intervention via user-generated Internet con...
Assessing the impact of a health intervention via user-generated Internet con...Assessing the impact of a health intervention via user-generated Internet con...
Assessing the impact of a health intervention via user-generated Internet con...Vasileios Lampos
 

Mehr von Vasileios Lampos (8)

Quicksort
QuicksortQuicksort
Quicksort
 
Transfer learning for unsupervised influenza-like illness models from online ...
Transfer learning for unsupervised influenza-like illness models from online ...Transfer learning for unsupervised influenza-like illness models from online ...
Transfer learning for unsupervised influenza-like illness models from online ...
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
Mining online data for public health surveillance
Mining online data for public health surveillanceMining online data for public health surveillance
Mining online data for public health surveillance
 
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
 
An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...
 
Extracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual dataExtracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual data
 
Assessing the impact of a health intervention via user-generated Internet con...
Assessing the impact of a health intervention via user-generated Internet con...Assessing the impact of a health intervention via user-generated Internet con...
Assessing the impact of a health intervention via user-generated Internet con...
 

Kürzlich hochgeladen

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 

Kürzlich hochgeladen (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 

Mining socio-political and socio-economic signals from social media content

  • 1. Mining socio-political and socio-economic signals from social media content Vasileios Lampos Department of Computer Science University College London Summer School on “Big Data & Networks in Social Sciences” University of Warwick, Sept. 21-23, 2016 @lampos | lampos.net
  • 2. Structure of the presentation 1. Introductory remarks 2. Collective inference tasks
 — Mining emotions
 — Modelling voting intention 3. Personalised inference tasks
 — Occupational class
 — Income
 — Socioeconomic status 4. Concluding remarks
  • 3. Context and motivation How can we use online user-generated content to enhance our understanding about our world? the Internet, the World Wide Web, connectivity numerous web products feeding from user activity user-generated content, publicly available, esp. on social media platforms (e.g. Twitter) large-scale digitised data, ‘Big Data’, ‘Data Science’
  • 4. Context and motivation How can we use online user-generated content to enhance our understanding about our world? the Internet, the World Wide Web, connectivity numerous web products feeding from user activity user-generated content, publicly available, esp. on social media platforms (e.g. Twitter) large-scale digitised data, ‘Big Data’, ‘Data Science’
  • 6. About Twitter > 140 characters per published status (tweet) > users can follow and be followed > embedded usage of topics (using #hashtags) > user interaction (re-tweets, @mentions, likes) > real-time nature > biased demographics (13-15% of UK’s population, age bias etc.) > information is noisy and not always accurate
  • 7. Inferring collective information 
 from user-generated content Lampos (Ph.D. Thesis, 2012) Lansdall-Welfare, Lampos & Cristianini (WWW 2012) Lampos, Preotiuc-Pietro & Cohn (ACL 2013) mood / emotions voting intention
  • 8. Emotion taxonomies and quantification ‘Emotional’ keywords, representing + anger, e.g. angry, irritate + fear, e.g. fearful, afraid + joy, e.g. cheerful, enthusiastic + sadness, e.g. depressed, gloomy + plus other emotions > WordNet Affect > Linguistic Inquiry and Word Count (LIWC) (Strapparava & Valitutti, 2004; Pennebaker et al., 2001, 2007) Simply — but maybe not good enough! — we compute the mean keyword frequency score per emotion
  • 9. Emotion taxonomies and quantification > WordNet Affect > Linguistic Inquiry and Word Count (LIWC) (Strapparava & Valitutti, 2004; Pennebaker et al., 2001, 2007) Simply — but maybe not good enough! — we compute the mean keyword frequency score per emotion ‘Emotional’ keywords, representing + anger, e.g. angry, irritate + fear, e.g. fearful, afraid + joy, e.g. cheerful, enthusiastic + sadness, e.g. depressed, gloomy + plus other emotions
  • 10. Circadian emotion patterns from Twitter (UK) Winter Summer Aggregated Data SadnessScore 3 6 9 12 15 18 21 24 -0.1 0 0.1 3 6 9 12 15 18 21 24 -0.1 0 0.1 JoyScore 3 6 9 12 15 18 21 24 -0.1 0 0.1 3 6 9 12 15 18 21 24 -0.1 0 0.1 Hourly Intervals Hourly Intervals re 1. Plots representing the variation over a 24-hour period of the emotional valen ear, sadness, joy and anger. The red line represents days in the winter, while the green one sents days in the summer. The average circadian pattern was extracted by aggregating the two nal data sets. Faded colourings represent the SE of the sample mean. 24h emotion patterns for ‘joy’ and ‘sadness’ for summer and winter with 95% confidence intervals
  • 11. ‘Joy’ time series based on Twitter (UK) 27august2012 time, in a manner that is partly predictable (or at least interpretable). We were reassured to find there was a periodic peak of joy around It is interesting that the growth in anger seems to have started before the riots them- selves, but this does not mean that we could Figure 2. Plot of the time series representing levels of joy estimator over a period of 2½ years. Notice the peaks corresponding to Christmas and New Year, Valentine’s day and the Royal Wedding Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 12 −2 0 2 4 6 8 10 933 Day Time Series for Joy in Twitter Content Date NormalisedEmotionalValence * RIOTS * CUTS * XMAS * XMAS * XMAS * roy.wed. * halloween * halloween * halloween * valentine* valentine * easter * easter raw joy signal 14−day smoothed joy Clear peaking pattern during XMAS or other annual celebrations (Valentine’s Day, Easter)
  • 12. Recession, riots, and Twitter emotions (UK) Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 12 −1 −0.5 0 0.5 1 1.5 Rate of Mood Change by Day using the Difference in 50−day Mean Date Differenceinmean Anger Fear Date of Budget Cuts Date of Riots Riots (UK)Budget Cuts (UK) Difference in mean mood score 50 days prior and after each date; peaks indicate increase in mood change
  • 13. Inferring voting intention — Data sets + 3 political parties (Conservatives, Labour, Lib Dem) + 42,000 Twitter users distributed proportionally to UK’s regional population figures + 60 million tweets, 80,976 1-grams + 240 polls from 30 Apr. 2010 to 13 Feb. 2012 United Kingdom + 4 political parties (SPO, OVP, FPO, GRU) + 1,100 active Twitter users selected by political scientists + 800,000 tweets, 22,917 1-grams + 98 polls from 25 Jan. to 25 Dec. 2012 Austria
  • 14. Regularised text regression✓v = qv q⇤ v q⇤ v xi 2 Rm, i 2 {1, . . . , n} — X yi 2 R, i 2 {1, . . . , n} — y wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ] f(xi) = xT i w + ✓v = qv qv q⇤ v xi 2 Rm, i 2 {1, . . . , n} — X yi 2 R, i 2 {1, . . . , n} — y wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ] f(xi) = xT i w + qv xi 2 Rm, i 2 {1, . . . , n} — X yi 2 R, i 2 {1, . . . , n} — y wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ] f(xi) = xT i w + observations responses weights, bias f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + Elastic Net argmin w, 8 < : nX i=1 0 @yi mX j=1 xijwj 1 A 2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 9 = ; L1-norm L2-norm (Zou & Hastie, 2005)
  • 15. Regularised text regression✓v = qv q⇤ v q⇤ v xi 2 Rm, i 2 {1, . . . , n} — X yi 2 R, i 2 {1, . . . , n} — y wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ] f(xi) = xT i w + ✓v = qv qv q⇤ v xi 2 Rm, i 2 {1, . . . , n} — X yi 2 R, i 2 {1, . . . , n} — y wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ] f(xi) = xT i w + qv xi 2 Rm, i 2 {1, . . . , n} — X yi 2 R, i 2 {1, . . . , n} — y wj, 2 R, j 2 {1, . . . , m} — w⇤ = [w; ] f(xi) = xT i w + observations responses weights, bias f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + Elastic Net ) = xT i w + i) = uTQiw + i) = tr UTQiW + argmin w, 8 < : nX i=1 0 @yi mX j=1 xijwj 1 A 2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 9 = ; L1-norm L2-norm (Zou & Hastie, 2005)
  • 16. Bilinear (users+text) regularised regression p 2 Z+ Qi 2 Rp⇥m, i 2 {1, . . . , n} — X yi 2 R, i 2 {1, . . . , n} — y uk, wj, 2 R, k 2 {1, . . . , p} — u, w, j 2 {1, . . . , m} users observations responses weights, bias f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + Bilinear Text Regression — The general idea (2/2) • users p œ Z+ • observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX • responses yi œ R, i œ {1, ..., n} — yyy • weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, — j œ {1, ..., m} f (QQQi) = uuuTQQQiwww + — ◊ ◊ + — uuuT QQQi www f(xi) = xT i w + f (Qi) = uTQiw + f(xi) = xT i w + f (Qi) = uTQiw + f(xi) = xT i w + f (Qi) = uTQiw +
  • 17. Bilinear elastic net (BEN)f (QQQi) = uuuTQQQiwww + — ◊ ◊ + — uuuT QQQi www v.lampos@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 16/45 16 /4 f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + where = tr UTQiW + argmin w, 8 < : nX i=1 0 @yi mX j=1 xijwj 1 A 2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 9 = ; argmin u,w, ( nX i=1 uT Qiw + yi 2 + (u, ✓u) + (w, ✓w) ) (x, 1, 2) = 1kxk`1 + 2kxk2 `2 = uTQiw + = tr UTQiW + argmin w, 8 < : nX i=1 0 @yi mX j=1 xijwj 1 A 2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 9 = ; argmin u,w, ( nX i=1 uT Qiw + yi 2 + (u, ✓u) + (w, ✓w) ) (x, 1, 2) = 1kxk`1 + 2kxk2 `2
  • 18. Training bilinear elastic net (BEN) 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 0.4 0.8 1.2 1.6 2 2.4 Step Global Objective RMSE Global objective function during training (red) Corresponding prediction error on held out data (blue) Biconvex problem + fix u, learn w and vice versa + iterate through convex optimisation tasks (Mairal et al., 2010)Large-scale solvers in SPAMS argmin w, : i=1 @yi j=1 xijwjA + 1 j=1 |wj| + 2 j=1 w2 j ; argmin u,w, ( nX i=1 uT Qiw + yi 2 + (u, ✓u) + (w, ✓w) ) (x, 1, 2) = 1kxk`1 + 2kxk2 `2
  • 19. Training bilinear elastic net (BEN) stic Net (BEN) ÿ 1 uuuT QQQiwww + — ≠ yi 22 BEN’s objective function u1 ÎuuuÎ2 ¸2 + ⁄u2 Îuuuθ1 w1 ÎwwwÎ2 ¸2 + ⁄w2 Îwwwθ1 J 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 0.4 0.8 1.2 1.6 2 2.4 Step Global Objective RMSE xity: fix uuu, learn www and vv through convex on tasks: convergence Global objective function during training (red) Corresponding prediction error on held out data (blue) Biconvex problem + fix u, learn w and vice versa + iterate through convex optimisation tasks (Mairal et al., 2010)Large-scale solvers in SPAMS argmin w, : i=1 @yi j=1 xijwjA + 1 j=1 |wj| + 2 j=1 w2 j ; argmin u,w, ( nX i=1 uT Qiw + yi 2 + (u, ✓u) + (w, ✓w) ) (x, 1, 2) = 1kxk`1 + 2kxk2 `2
  • 20. Bilinear and multi-task regression ⌧ 2 Z+ p 2 Z+ Qi 2 Rp⇥m, i 2 {1, . . . , n} — X yi 2 R⌧ , i 2 {1, . . . , n} — Y uk, wj, 2 R⌧ , k 2 {1, . . . , p} — U, W, j 2 {1, . . . , m} tasks users observations responses weights, bias f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + Bilinear Multi-Task Learning • tasks · œ Z+ • users p œ Z+ • observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX • responses yyyi œ R· , i œ {1, ..., n} — YYY • weights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ——— j œ {1, ..., m} f (QQQi) = tr 1 UUUTQQQiWWW 2 + ——— ◊ ◊ UUUT QQQi WWW f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + f(xi) = xT i w + f (Qi) = uTQiw + f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW +
  • 21. Bilinear Group L2,1 (BGL) + a nonzero weighted feature (user or word) is encouraged to be nonzero for all tasks, but with potentially different weights + intuitive for political preference inference f (QQQi) = tr 1 UUUTQQQiWWW 2 + ——— ◊ ◊ UUUT QQQi WWW v.lampos@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 23/45 23 /45 f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + f(xi) = xT i w + f (Qi) = uTQiw + f (Qi) = tr UTQiW + argmin w, : i=1 @yi j=1 xijwjA + 1 j=1 |wj| + 2 j=1 wj ; argmin u,w, ( nX i=1 uT Qiw + yi 2 + (u, ✓u) + (w, ✓w) ) (x, 1, 2) = 1kxk`1 + 2kxk2 `2 argmin U,W, 8 < : ⌧X t=1 nX i=1 uT Qiwt + t yti 2 + u pX k=1 kUkk2 + w mX j=1 kWjk2 9 = ;
  • 22. Voting intention inference performanceRootMeanSquaredError 0 1 2 2 3 UK Austria 1.4391.478 1.699 1.573 1.442 3.067 1.47 1.723 1.851 1.69 Mean poll Last poll Elastic Net (words) BEN BGL
  • 23. Voting intention inference performanceRootMeanSquaredError 0 1 2 2 3 UK Austria 1.4391.478 1.699 1.573 1.442 3.067 1.47 1.723 1.851 1.69 Mean poll Last poll Elastic Net (words) BEN BGL
  • 24. Voting intention comparative plots 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 VotingIntention% Time CON LAB LIB BEN 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 VotingIntention% Time CON LAB LIBBGL 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 VotingIntention% Time CON LAB LIBYouGov
  • 25. Voting intention comparative plots 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 VotingIntention% Time SPÖ ÖVP FPÖ GRÜ Polls 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 VotingIntention% Time SPÖ ÖVP FPÖ GRÜ BEN 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 VotingIntention% Time SPÖ ÖVP FPÖ GRÜ BGL
  • 26. Qualitative insights Party Tweet Score User type SPÖ centre Inflation rate in Austria slightly down in July from 2.2 to 2.1%. Accommodation, Water, Energy more expensive. 0.745 Journalist ÖVP centre right Can really recommend the book “Res Publica” by Johannes #Voggenhuber! Food for thought and so on #Europe #Democracy -2.323 Normal user FPÖ
 far right Campaign of the Viennese SPO on “Living together” plays right into the hands of right- wing populists -3.44 Human rights GRÜ centre left Protest songs against the closing-down of the bachelor course of International Development: <link> #ID_remains #UniBurns #UniRage 1.45 Student Union
  • 27. Inferring user-level information 
 from user-generated content Preotiuc-Pietro, Lampos & Aletras (ACL 2015) Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras (PLOS ONE, 2015) Lampos, Aletras, Geyti, Zou & Cox (ECIR 2016) occupational class income socio-economic status (SES)
  • 28. Linguistic expression and demographics “Socioeconomic variables are influencing language use.” + Validate this hypothesis on a broader, larger data set using social media + Applications > research, as in computational social science, health, and psychology > commercial (Bernstein, 1960; Labov, 1972/2006)
  • 29. Standard Occupational Classification (SOC) Major Group 1 (C1): Managers, Directors and Senior Officials Sub-major Group 11: Corporate Managers and Directors Minor Group 111: Chief Executives and Senior Officials Unit Group 1115: Chief Executives and Senior Officials •Job: chief executive, bank manager Unit Group 1116: Elected Officers and Representatives Minor Group 112: Production Managers and Directors Minor Group 113: Functional Managers and Directors Minor Group 115: Financial Institution Managers and Directors Minor Group 116: Managers and Directors in Transport and Logistics Minor Group 117: Senior Officers in Protective Services Minor Group 118: Health and Social Services Managers and Directors Minor Group 119: Managers and Directors in Retail and Wholesale Sub-major Group 12: Other Managers and Proprietors Major Group (C2): Professional Occupations •Job: mechanical engineer, pediatrist Major Group (C3): Associate Professional and Technical Occupations •Job: system administrator, dispensing optician Major Group (C4): Administrative and Secretarial Occupations •Job: legal clerk, company secretary Major Group (C5): Skilled Trades Occupations •Job: electrical fitter, tailor Major Group (C6): Caring, Leisure and Other Service Occupations •Job: nursery assistant, hairdresser Major Group (C7): Sales and Customer Service Occupations •Job: sales assistant, telephonist Major Group (C8): Process, Plant and Machine Operatives •Job: factory worker, van driver Major Group (C9): Elementary Occupations •Job: shelf stacker, bartender 9 major groups 25 sub-major groups 90 minor groups 369 unit groups provided by the Office for National Statistics (UK)
  • 30. Standard Occupational Classification (SOC) C1 — Managers, Directors & Senior Officials (chief executive, bank manager) C2 — Professional Occupations (postdoc, pediatrist) C3 — Associate Professional & Technical (system administrator, dispensing optician) C4 — Administrative & Secretarial (legal clerk, secretary) C5 — Skilled Trades (electrical fitter, tailor) C6 — Caring, Leisure, Other Service (nursery assistant, hairdresser) C7 — Sales & Customer Service (sales assistant, telephonist) C8 — Process, Plant and Machine Operatives (factory worker, van driver) C9 — Elementary (shelf stacker, bartender) The 9 major occupational classes (C1-9)
  • 31. Forming a Twitter user data set + 5,191 Twitter users mapped to their occupations, then mapped to one of the 9 SOC categories + 10 million tweets + Download the data set % of users per SOC category 0 7 14 21 28 35 C1 C2 C3 C4 C5 C6 C7 C8 C9
  • 32. Twitter user attributes (18 in total) number of — followers — friends — followers/friends (ratio) — times listed — tweets — favourites (likes) — unique @-mentions — tweets/day (avg.) — retweets/tweet (avg.) proportion of — retweets done — non duplicate tweets — retweeted tweets — hashtags — tweets with hashtags — tweets with @-mentions — @-replies — tweets with links — tweets in English Similarly to our paper for user impact estimation (Lampos et al., 2014)
  • 33. Twitter user discussion topics (I) Topics — Word clusters (#: 30, 50, 100, 200) + SVD on the graph laplacian of the word by word similarity matrix using normalised PMI, i.e. a form of spectral clustering + Word2vec (skip-gram with negative sampling) to learn word embeddings; pairwise cosine similarity on the embeddings to derive a word by word similarity matrix; then spectral clustering on the similarity matrix (Bouma, 2009; von Luxburg, 2007) (Mikolov et al., 2013)
  • 34. Twitter user discussion topics (II) Topic Most central words; Most frequent words Arts archival, stencil, canvas, minimalist; art, design, print Health chemotherapy, diagnosis, disease; risk, cancer, mental, stress Beauty Care exfoliating, cleanser, hydrating; beauty, natural, dry, skin Higher Education undergraduate, doctoral, academic, students, curriculum; students, research, board, student, college, education, library Football bardsley, etherington, gallas; van, foster, cole, winger Corporate consortium, institutional, firm’s; patent, industry, reports Elongated Words yaaayy, wooooo, woooo, yayyyyy, yaaaaay, yayayaya, yayy; wait, till, til, yay, ahhh, hoo, woo, woot, whoop, woohoo Politics religious, colonialism, christianity, judaism, persecution, fascism, marxism; human, culture, justice, religion, democracy
  • 35. A few words about Gaussian Processes Why do we use Gaussian Processes? + Kernelised, models nonlinearities + Interpretability (AutoRelevance Determination) + Performance re- he 2) rd. eu- ed a a els on Formally, GP methods aim to learn a function f : Rd ! R drawn from a GP prior given the inputs xxx 2 Rd: f(xxx) ⇠ GP(m(xxx), k(xxx,xxx0 )) , (3) where m(·) is the mean function (here 0) and k(·, ·) is the covariance kernel. Usually, the Squared Ex- ponential (SE) kernel (a.k.a. RBF or Gaussian) is used to encourage smooth functions. For the multi- dimensional pair of inputs (xxx,xxx0), this is: kard(xxx,xxx0 ) = 2 exp " dX i (xi x0 i)2 2l2 i # , (4) on, while the clusters can be interpreted ing a list of the most frequent or repre- words. The latter are identified using the centrality metric: Cw = P x2c NPMI(w, x) |c| 1 , (2) enotes the cluster and w the target word. ural Embeddings (W2V-E) there has been a growing interest in neu- ge models, where the words are projected er dimensional dense vector space via a er (Mikolov et al., 2013b). These models and Cohn, 2013) with limited to (Polajnar et Formally, GP meth f : Rd ! R drawn inputs xxx 2 Rd: f(xxx) ⇠ GP where m(·) is the mean is the covariance kern ponential (SE) kernel used to encourage smo dimensional pair of in kard(xxx,xxx0 ) = 2 exp e- he 2) d. u- d a ls n Formally, GP methods aim to learn a function f : Rd ! R drawn from a GP prior given the inputs xxx 2 Rd: f(xxx) ⇠ GP(m(xxx), k(xxx,xxx0 )) , (3) where m(·) is the mean function (here 0) and k(·, ·) is the covariance kernel. Usually, the Squared Ex- ponential (SE) kernel (a.k.a. RBF or Gaussian) is used to encourage smooth functions. For the multi- dimensional pair of inputs (xxx,xxx0), this is: kard(xxx,xxx0 ) = 2 exp " dX i (xi x0 i)2 2l2 i # , (4) where li are lengthscale parameters learnt only Say and we want to learn Formally: Sets of random variables any finite number of which have a multivariate Gaussian distribution mean function drawn on inputs covariance function (kernel) drawn on pairs of inputs (Rasmussen & Williams, 2006)
  • 36. More information about Gaussian Processes + Book: “Gaussian Processes for Machine Learning”
 http://www.gaussianprocess.org/gpml/ + Video-lecture: “Gaussian Process Basics”
 http://videolectures.net/gpip06_mackay_gpb/ + Tutorial tailored to statistical NLP tasks: “Gaussian Processes for Natural Language Processing”
 http://people.eng.unimelb.edu.au/tcohn/tutorial.html + Software I — GPML for Octave or MATLAB
 http://www.gaussianprocess.org/gpml/code + Software II — GPy for Python
 http://sheffieldml.github.io/GPy/
  • 37. Gaussian Process classifierin neu- ojected ce via a models ntation models syntac- xt, han- dimen- nce are he skip- ov et al., Twitter m model evy and used to encourage smooth functions. For the multi- dimensional pair of inputs (xxx,xxx0), this is: kard(xxx,xxx0 ) = 2 exp " dX i (xi x0 i)2 2l2 i # , (4) where li are lengthscale parameters learnt only using training data by performing gradient as- cent on the type-II marginal likelihood. Intuitively, the lengthscale parameter li controls the variation along the i input dimension, i.e. a low value makes the output very sensitive to input data, thus mak- ing that input more useful for the prediction. If the lengthscales are learnt separately for each input dimension the kernel is named SE with Automatic Relevance Determination (ARD) (Neal, 1996). + Squared-exponential ARD covariance function: determines (quantify) the relevancy of each user feature, i.e. the relevance of feature i is inversely proportional to the length-scale hyper-parameter li + 9-class classification using one vs. all + GP hyper-parameter learning with Expectation 
 Propagation + Inference using FITC (500 inducing points)
  • 38. Occupation classification performance Accuracy(%) 25 31 37 43 49 55 User Attributes Topics (SVD) Topics (word2vec) 52.7 48.2 34.2 51.7 47.9 31.5 46.9 44.2 34 Logistic Regression SVM (RBF) Gaussian Process (SE-ARD) most frequent class baseline (34.4%)
  • 39. Occupation classification performance Accuracy(%) 25 31 37 43 49 55 User Attributes Topics (SVD) Topics (word2vec) 52.7 48.2 34.2 51.7 47.9 31.5 46.9 44.2 34 Logistic Regression SVM (RBF) Gaussian Process (SE-ARD) most frequent class baseline (34.4%)
  • 40. Occupation classification performance Accuracy(%) 25 31 37 43 49 55 User Attributes Topics (SVD) Topics (word2vec) 52.7 48.2 34.2 51.7 47.9 31.5 46.9 44.2 34 Logistic Regression SVM (RBF) Gaussian Process (SE-ARD) most frequent class baseline (34.4%)
  • 41. Occupation classification insights (I) 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Higher Education (#21) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent ! CDF line closer to bottom-right corner CDF of the topic “Higher Education”: Topic more prevalent in the upper classes (C2, which includes education professionals, and C1), and less so in the lower classes
  • 42. Occupation classification insights (II) CDF of the topic “Arts”: Topic more prevalent in C5 (which includes artists) and the upper classes 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Arts (#116) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent ! CDF line closer to bottom-right corner
  • 43. Occupation classification insights (II) CDF of the topic “Arts”: Topic more prevalent in C5 (which includes artists) and the upper classes 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Arts (#116) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent ! CDF line closer to bottom-right corner
  • 44. Occupation classification insights (III) CDF of the topic “Elongated Words”: Topic more prevalent in the lower classes, and less so in the upper classes 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Elongated Words (#164) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent ! CDF line closer to bottom-right corner
  • 45. Occupation classification insights (III) CDF of the topic “Elongated Words”: Topic more prevalent in the lower classes, and less so in the upper classes 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Elongated Words (#164) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent ! CDF line closer to bottom-right corner
  • 46. Occupation classification insights (IV) Topic distribution distance (Jensen-Shannon divergence) for the different occupational classes (1-9) Higher Ed- l as topics ‘Football’ re’ (#153). cales only iscrimina- hich topic formance. ge across cs are cov- shows the Fs) across r these six ers having tweets. A CDF line the plot. evalent in similar pattern in both topics by which users with lower skilled jobs tweet more often. Figure 3: Jensen-Shannon divergence in the topic distributions between the different occupational classes (C 1–9). Occupational Class OccupationalClass
  • 47. Occupation classification insights (IV) Topic distribution distance (Jensen-Shannon divergence) for the different occupational classes (1-9) Higher Ed- l as topics ‘Football’ re’ (#153). cales only iscrimina- hich topic formance. ge across cs are cov- shows the Fs) across r these six ers having tweets. A CDF line the plot. evalent in similar pattern in both topics by which users with lower skilled jobs tweet more often. Figure 3: Jensen-Shannon divergence in the topic distributions between the different occupational classes (C 1–9). Occupational Class OccupationalClass
  • 48. Occupation classification insights (IV) Topic distribution distance (Jensen-Shannon divergence) for the different occupational classes (1-9) Occupational Class OccupationalClass Higher Ed- l as topics ‘Football’ re’ (#153). cales only iscrimina- hich topic formance. ge across cs are cov- shows the Fs) across r these six ers having tweets. A CDF line the plot. evalent in similar pattern in both topics by which users with lower skilled jobs tweet more often. Figure 3: Jensen-Shannon divergence in the topic distributions between the different occupational classes (C 1–9).
  • 49. Occupation classification insights (V) Health Beauty Care Education Football* Corporate Elongated Words Politics Topic scores for occupational class supersets 1.06 3.78 1.41 1.04 2.56 2.24 2.13 2.14 1.9 5.15 1.08 6.04 1.4 4.45 Classes 1-2 Classes 6-9 * times 2 for visualisation purposes
  • 50. Additional ‘perceived’ user features + Previously used features: Profile features, Shallow profile features, and Topics + Based on the work of Volkova et al. (2015), we also incorporated: > Inferred Psycho-Demographic features (15)
 e.g. gender, age, education level, religion, life satisfaction, excitement, anxiety etc. > Emotions (9)
 e.g. positive / negative sentiment, joy, anger, fear, disgust, sadness, surprise etc.
  • 51. Defining the user income regression task (e.g. ‘spare time guitarist’), contained multiple possible jobs (e.g. ‘marketer, social media ana- lyst’) or represented an institutional account (e.g. ‘limo driver company’). In total, around 50% Table 1. Subset of the SOC classification hierarchy. Group 112: Production Managers and Directors (50,952 GBP/year) •Job titles: engineering manager, managing director, production manager, construction manager, quarry manager, operations manager Group 241: Conservation and Environment Professionals (53,679 GBP/year) •Job titles: conservation officer, ecologist, energy conservation officer, heritage manager, marine conservationist, energy manager, environmental consultant, environmental engineer, environmental protection officer, environmental scientist, landfill engineer Group 312: Draughtspersons and Related Architectural Technicians (29,167 GBP/year) •Job titles: architectural assistant, architectural, technician, construction planner, planning enforcement officer, cartographer, draughtsman, CAD operator Group 411: Administrative Occupations: Government and Related Organisations (20,373 GBP/year) •Job titles: administrative assistant, civil servant, government clerk, revenue officer, benefits assistant, trade union official, research association secretary Group 541: Textiles and Garments Trades (18,986 GBP/year) •Job titles: knitter, weaver, carpet weaver, curtain maker, upholsterer, curtain fitter, cobbler, leather worker, shoe machinist, shoe repairer, hosiery cutter, dressmaker, fabric cutter, tailor, tailoress, clothing manufacturer, embroiderer, hand sewer, sail maker, upholstery cutter Group 622: Hairdressers and Related Services (10,793 GBP/year) •Job titles: barber, colourist, hair stylist, hairdresser, beautician, beauty therapist, nail technician, tattooist Group 713: Sales Supervisors (18,383 GBP/year) •Job titles: sales supervisor, section manager, shop supervisor, retail supervisor, retail team leader Group 813: Assemblers and Routine Operatives (22,491 GBP/year) •Job titles: assembler, line operator, solderer, quality assurance inspector, quality auditor, quality controller, quality inspector, test engineer, weightbridge operator, type technician Group 913: Elementary Process Plant Occupations (17,902 GBP/year) •Job titles: factory cleaner, hygene operator, industrial cleaner, paint filler, packaging operator, material handler, packer doi:10.1371/journal.pone.0138717.t001 Same Twitter data set as in the job classification task Use an income mapping from SOC to create real-valued target data for the regression task
  • 52. User income regression: data Income prediction 10k 30k 50k 100k 0 200 400 600 800 1000 Yearly income (£) No.Users We approach the task as regression. + 5,191 Twitter users mapped to their occupations, then mapped to an average income in GBP (£) using the SOC taxonomy + ~11 million tweets + Download the data
  • 53. User income regression performance MAE £9,000 £9,500 £10,000 £10,500 £11,000 £11,500 Income inference error (Mean Absolute Error) using GP regression or a linear ensemble for all features Feature Categories £9,535 £9,621 £11,456 £10,980 £10,110 £11,291 Profile Demo Emotions Shallow Topics All features
  • 54. User income regression insights (I)Studying User Income in Social Media
  • 55. User income regression insights (II) • Neutral sentiment increases with income, while both positive and negative sentiment decrease. This uncovers that lower income users express more subjectivity online; • Anger and fear emotions are more present in users with higher income while sadness, sur- Fig 3. Linear and non-linear (GP) fit for Profile features. Variation of income as a function of user profile features. Linear fit in red, non-linear Gaussian Process fit in black. Brackets show the GP lengthscales—the lower the value, the more important the feature is for prediction. doi:10.1371/journal.pone.0138717.g003 Studying User Income in Social Media Relating income and user attributes Linear vs GP fit
  • 56. User income regression insights (III) e1: positive (l=46.27) e2: neutral (l=57.64) e3: negative(l=76.34) e4: joy (l=36.37) e5: sadness (l=67.05) e6: disgust (l=116.66) e7: anger (l=95.50) e8: surprise (l=83.61) e9: fear (l=31.74) 28000 35000 42000 28000 35000 42000 28000 35000 42000 0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 0.05 0.10 0.15 0.20 0.5 0.6 0.7 0.8 0.05 0.10 0.010 0.015 0.020 0.025 0.030 0.01 0.02 0.03 0.04 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 Feature value Income Relating income and emotion Linear vs GP fit
  • 57. User income regression insights (IV) Topic 107 (Justice) Topic 124 (Corporate 1) Topic 139 (Politics) Topic 163 (NGOs) Topic 196 (Web analytics/Surveys) Topic 99 (Swearing) 30000 40000 50000 30000 40000 50000 0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.000 0.025 0.050 0.075 0.000 0.025 0.050 0.075 0.100 0.00 0.01 0.02 0.03 0.04 0.00 0.03 0.06 0.09 0.12 Feature value Income Relating income and topics of discussion Linear vs GP fit
  • 58. Defining a user SES classification task Profile description on Twitter Occupation SOC category1 NS-SEC2 1. Standard Occupational Classification job groups 2. National Statistics Socio-Economic Classification: Map from the job groups in the SOC to a socioeconomic status (SES): upper, middle or lower
  • 59. UK Twitter user data set for SES classification + 1,342 UK Twitter user profiles + 2 million tweets + Date interval: Feb. 1, 2014 to March 21, 2015 + Labelled with a socioeconomic status (SES), using the occupational class proxy from SOC and NS-SEC: upper, middle, or lower + 1,291 user features following the previous paradigms, i.e. quantifying behaviour, impact, profile info, text in tweets and topics from tweets + Download the data set
  • 60. SES classification performance Classification Accuracy (%) Precision (%) Recall (%) F1 2 classes 82.05 (2.4) 82.2 (2.4) 81.97 (2.6) .821 (.03) 3 classes 75.09 (3.3) 72.04 (4.4) 70.76 (5.7) .714 (.05) … using a Gaussian Process classifier T1 T2 T3 P O1 606 84 53 81.6% O2 49 186 45 66.4% O3 55 48 216 67.7% R 854% 58.5% 68.8% 75.1% 3-class classification T1 T2 P O1 584 115 83.5% O2 126 517 80.4% R 82.3% 81.8% 82.0% middle & lower merged
  • 61. Conclusions — Mining socio-political and socio-economic signals from social media collective emotion voting intention occupational class income socio-economic status
  • 62. Further thoughts + User-generated content is a valuable asset + Nonlinear models tend to perform better given the multimodality of the feature space + Deeper representations of text tend to improve performance + Qualitative analysis is important > Evaluation > Interesting insights
  • 63. Some of the future research challenges + Work closer with domain experts + Better understanding of online media biases, e.g. demographics, external influence etc. + Generalisation, defining limitations, more rigorous evaluation frameworks + Methodological improvements + Ethical concerns
  • 64. Acknowledgements Currently funded by All collaborators (in alphabetical order) in research mentioned today Nikolaos Aletras (Amazon)
 Yoram Bachrach (Microsoft Research) 
 Trevor Cohn (University of Melbourne)
 Ingemar J. Cox (UCL)
 Nello Cristianini (University of Bristol) Daniel Preotiuc-Pietro (Penn) Thomas Lansdall-Welfare (University of Bristol) Svitlana Volkova (PNNL) Bin Zou (UCL)
  • 65. Thank you! Any questions? Slides can be downloaded from lampos.net/talks @lampos | lampos.net
  • 66. References Argyriou, Evgeniou & Pontil. Convex Multi-Task Feature Learning (Machine Learning, 2008) Bernstein. Language and social class (Br J Sociol, 1960) Bouma. Normalized (pointwise) mutual information in collocation extraction (GSCL, 2009) Labov. The Social Stratification of English in New York City (Cambridge Univ Press, 1972; 2006, 2nd ed.) Lampos. Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning Methods (Ph.D. Thesis, University of Bristol, 2012) Lampos, Aletras, Geyti, Zou & Cox. Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language (ECIR, 2016) Lampos, Preotiuc-Pietro, Aletras & Cohn. Predicting and Characterising User Impact on Twitter (EACL, 2014) Lampos, Preotiuc-Pietro & Cohn. A user-centric of voting intention from Social Media (ACL, 2013) Lansdall-Welfare, Lampos & Cristianini. Effects of the Recession on Public Mood in the UK (WWW, 2012) Mairal, Jenatton, Obozinski & Bach. Network Flow Algorithms for Structured Sparsity (NIPS, 2010) Mikolov, Chen, Corrado & Dean. Efficient estimation of word representations in vector space (ICLR, 2013) Pennebaker, Booth & Francis. Linguistic Inquiry and Word Count: LIWC2007 (Tech. Report, 2001, 2007) Preotiuc-Pietro, Lampos & Aletras. An analysis of the user occupational class through Twitter content (ACL, 2015) Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras. Studying User Income through Language, Behaviour and Affect in Social Media (PLoS ONE, 2015) Rasmussen & Williams. Gaussian Processes for Machine Learning (MIT Press, 2006) Strapparava & Valitutti. WordNet-Affect: An affective extension of WordNet. LREC, 2004. Volkova, Bachrach, Armstrong & Sharma. Inferring Latent User Properties from Texts Published in Social Media (AAAI, 2015) von Luxburg. A tutorial on spectral clustering (Stat Comput, 2007) Zou & Hastie. Regularization and variable selection via the elastic net (J R Stat Soc Series B Stat Methodol, 2005)