8th Extended Semantic Web Conference (ESWC 2011), Heraklion, Greece
In this event, the OU team presented their work towards predicting
discussions on the Social Semantic Web, by (a) identifying seed
posts, then (b) making predictions
on the level of discussion that such posts will generate. This
analysis helps policy makers to predict which discussions and users
will generate higher level of attention within the community.
Hybridoma Technology ( Production , Purification , and Application )
Eswc2011 socialweb-discussionpredictions
1. Predicting Discussions on
the Social Semantic Web
Matthew Rowe, Sofia Angeletou and
Harith Alani
Knowledge Media Institute, The Open
University, Milton Keynes, United Kingdom
2. Mass of Social Data
Social content is now published at a staggering
rate….
Predicting Discussions on the Social Semantic Web 1
3. Social Data
Publication Rates
• ~600 Tweets per second [1]
• ~700 Facebook status updates per second [1]
• Spinn3r dataset collected from Jan – Feb 2011
[2]
– 133 million blog posts
– 5.7 million forum posts
– 231 million social media posts
[1] http://searchengineland.com/by-the-numbers-twitter-vs-facebook-vs-google-buzz-36709
[2] http://icwsm.org/data/index.php
Predicting Discussions on the Social Semantic Web 2
4. The New
Information Era
Predicting Discussions on the Social Semantic Web 3
5. But…. Analysis is
Limited
• Market Analysts
– What are people saying about my products?
• Opinion Mining
– How are people perceiving a given subject or
topic?
• eGovernment Policy Makers
– How is a policy or law received by the public?
– How can I maximise feedback to my content?
Predicting Discussions on the Social Semantic Web 4
6. Attention
Economics
• Given all this data…
How do we decide on what information
to focus on?
How do we know what posts will
evolve into discussions?
• Attention Economics (Goldhaber, 1997)
• Need to understand key indicators of high-
attention discussions
Predicting Discussions on the Social Semantic Web 5
7. Discussions on
Twitter
• Twitter is used as medium to:
– Share opinions and ideas
– Engage in discussions
• Discussing events
• Debating topics
• Identifying online discussions enables:
– Up-to-date public opinion
– Observation of topics of interest
– Gauging the popularity of government policies
– Fine-grained customer support
Predicting Discussions on the Social Semantic Web 6
8. Predicting
Discussions
• Pre-empt discussions on the Social Web:
1. Identifying seed posts
• i.e. posts that start a discussion
• Will a given post start a discussion?
• What are the key features of seed posts?
2. Predicting discussion activity levels
• What is the level of discussion that a seed post
will generate?
• What are the key factors of lengthy discussions?
Predicting Discussions on the Social Semantic Web 7
9. The Need for
Semantics
• For predictions we require statistical features
– User features
– Content features
• Features provided using differing schemas by
different platforms
– How to overcome heterogeneity?
• Currently, no ontologies capture such features
Predicting Discussions on the Social Semantic Web 8
10. Behaviour Ontology
www.purl.org/NET/oubo/0.23/
Predicting Discussions on the Social Semantic Web 9
11. T he likelihood of post s elicit ing replies depends upon popularity, a highly subjec-
t ive t erm influenced by ext ernal fact ors. Propert ies influencing popularity include
Features
user at t ribut es - describing t he reput at ion of t he user - and at t ribut es of a post ’s
cont ent - generally referred t o as cont ent feat ures. In Table 1 we define user and
cont ent feat ures and st udy t heir influence on t he discussion “ cont inuat ion” .
T abl e 1. User and Cont ent Feat ures
U ser Feat ur es
I n D eg r ee: N umb er of fol l ower s of U #
O u t D eg r ee: N umb er of user s U fol l ows #
L i st D eg r ee: N umb er of l i st s U app ear s on. L i st s gr ou p user s by t op i c #
P o st C o u n t : T ot al numb er of p ost s t he user has ever p ost ed #
U ser A g e: N umb er of m i nut es fr om user j oi n dat e #
P ost C ou n t
P o st R at e: Post i ng fr equency of t he user U s er A g e
Cont ent Feat ur es
P o st l en g t h : L engt h of t he p ost i n char act er s #
C o m p l ex i t y : Cumul at i ve ent r opy of t he uni que wor ds i n p ost p λ
i ∈ [1, n ] p i ( l og λ − l og p i )
of t ot al wor d l engt h n and pi t he fr equency of each wor d λ
U p p er ca se co u n t : N umb er of upp er case wor ds #
R ead ab i l i t y : G unni ng fog i ndex usi ng aver age sent ence l engt h ( A SL ) [7]
and t he p er cent age of com pl ex wor d s ( P C W ) . 0.4( A SL + P C W )
V er b C o u n t : N umb er of ver bs #
N ou n C ou nt : N umb er of nouns #
A d j ect i v e C ou n t : N umb er of adj ect i ves #
R ef er r al C o u n t : N umb er of @user #
T i m e i n t h e d ay : N or m al i sed t i m e i n t he day m easur ed i n m i nut es #
I n f o r m at i v en ess: T er m i nol ogi cal novel t y of t he p ost wr t ot her p ost s
T he cumul at i ve t fI df val ue of each t er m t i n p ost p t∈p t f i df ( t , p)
P o l ar i t y : Cumul at ion of p ol ar t er m weight s in p ( using
P o+ N e
Sent i wor dnet 3 l ex i con) nor m al i sed by p ol ar t er m s count | t er m s |
4.2 Ex p er im ent s
Predicting Discussions on the Social Semantic Web 10
Experiment s are int ended t o t est t he performance of different classificat ion mod-
12. different motives which do not necessarily designat e a response t o t he initial
Identifying Seed
message. T herefore, we only invest igat e explicit replies to messages. To gather
our discussions, and our seed post s, we it erat ively move up t he reply chain - i.e.,
Posts
from reply t o parent post - until we reach t he seed post in t he discussion. We
define this process as dataset enrichment, and is performed by querying T wit t er’s
• Experiments
REST API 6 using t he in reply to id of t he parent post, and moving one-st ep at
a t ime up the reply chain. T his same approach has been employed successfully
– Haiti and Union Address Datasets
in work by [12] t o gather a large-scale conversation dat aset from T wit ter.
– Divided each dataset up using 70/20/10 split for
training/validation/testing s used for experiment s
T abl e 2. St at ist ics of t he dat aset
D at aset U ser s T weet s Seeds N on-Seeds Repli es
H ai t i 44,497 65,022 1,405 60,686 2,931
U ni on A ddr ess 66,300 80,272 7,228 55,169 17,875
TableEvaluated a binary classification task datasets. One can
– 2 shows the statist ics that explain our collected
• Is this post a seed post or not?
observe the difference in conversat ional tweet s between t he two corpora, where
t he Hait i dat aset cont ains fewer F1 and Areaa percent age t han t he Union
• Precision, Recall, seed post s as under ROC
dataset, and therefore fewer replies. However, as we explain lat er a lat er sect ion,
• Tested: user, content, user+content features
this does not correlate with a higher discussion volume in the former dat aset. We
convert t he collected dataset s from SVM, Naïve Bayesformat s int o triples,
– Tested Perceptron, t heir propriet ary JSON and J48
annot ated using concepts from our above behaviour ont ology, this enables our
features t o be derived by querying our dat aset s using basic SPARQL queries.
Predicting Discussions on the Social Semantic Web 11
13. t he F1 score - and used t his model t oget her wit h t he best performing feat ures,
Identifying Seed
using a ranking heurist ic, t o classify post s cont ained wit hin our t est split . We
first report on t he result s obt ained from our model select ion phase, before moving
Posts
ont o our result s from using t he best model wit h t he t op-k feat ures.
T abl e 3. Result s from t he classificat ion of seed post s using varying feat ure set s and
classifi cat ion models
(a) Hait i Dat aset (b) Union A ddress Dat aset
P R F1 ROC P R F1 ROC
U ser Per c 0.794 0.528 0.634 0.727 U ser Per c 0.658 0.697 0.677 0.673
SV M 0.843 0.159 0.267 0.566 SV M 0.510 0. 946 0.663 0.512
NB 0.948 0.269 0.420 0.785 NB 0.844 0.086 0.157 0.707
J48 0.906 0.679 0.776 0.822 J48 0.851 0.722 0.782 0.830
Cont ent Per c 0.875 0.077 0.142 0.606 Cont ent Per c 0.467 0. 698 0.560 0.457
SV M 0.552 0.727 0.627 0.589 SV M 0.650 0.589 0.618 0.638
NB 0.721 0.638 0.677 0.769 NB 0.762 0.212 0.332 0.649
J48 0.685 0.705 0.695 0.711 J48 0.740 0.533 0.619 0.736
A ll Per c 0.794 0.528 0.634 0.726 A ll Per c 0.630 0.762 0.690 0.672
SV M 0.483 0.996 0.651 0.502 SV M 0.499 0. 990 0.664 0.506
NB 0.962 0.280 0.434 0.852 NB 0.874 0.212 0.341 0.737
J48 0.824 0.775 0.798 0.836 J48 0.890 0.810 0.848 0.877
4.3 R esult s
Our findings from Table 3 demonst rat e t he effect iveness of using solely user
features for ident ifying seed post s. In bot h t he Hait i and Union Address12 aset s
Predicting Discussions on the Social Semantic Web
dat
14. Identifying Seed
which we found t o be 0.674 indicat ing a good correlat ion between t he two list s
Posts
and t heir respect ive ranks.
T abl e 4. Feat ures ranked by Informat ion Gain Rat io wrt Seed Post class label. T he
feat ure name is paired wittheit most bracket s.
• What are hin s IG in important features?
R ank H ai t i U ni on A ddr ess
1 user -l i st -degr ee ( 0.275) user -l i st -degr ee ( 0.319)
2 user -i n-degr ee ( 0.221) cont ent -t i m e-i n-day ( 0.152)
3 cont ent -i nfor m at i veness ( 0.154) user -i n-degr ee ( 0.133)
4 user -num -p ost s ( 0.111) user -num -p ost s ( 0.104)
5 cont ent -t i m e-i n-day ( 0.089) user -p ost -r at e ( 0.075)
6 user -p ost -r at e ( 0.075) user -out -degr ee ( 0.056)
7 cont ent -p ol ar i t y ( 0.064) cont ent -r efer r al -count ( 0.030)
8 user -out -degr ee ( 0.040) user -age ( 0.015)
9 cont ent -r efer r al -count ( 0.038) cont ent -p ol ar i t y ( 0.015)
10 cont ent -l engt h ( 0.020) cont ent -l engt h ( 0.010)
11 cont ent -r eadabi l i t y ( 0.018) cont ent -com pl exi t y ( 0.004)
12 user -age ( 0.015) cont ent -noun-count ( 0.002)
13 cont ent -upp er case-count ( 0.012) cont ent -r eadabi l i t y ( 0.001)
14 cont ent -noun-count ( 0.010) cont ent -ver b-count ( 0.001)
15 cont ent -ad j -count ( 0.005) cont ent -ad j -count ( 0.0)
16 cont ent -com pl exi t y ( 0.0) cont ent -i nfor m at i veness ( 0.0)
17 cont ent -ver b-count ( 0.0) cont ent -upp er case-cou nt ( 0.0)
Predicting Discussions on the Social Semantic Web 13
15. 5 cont ent -t i m e-i n-day ( 0.089) user -p ost -r at e ( 0.075)
Identifying Seed
6 u ser -p ost -r at e ( 0.075) user -out -degr ee ( 0.056)
7 cont ent -p ol ar i t y ( 0.064) cont ent -r efer r al -count ( 0.030)
8 u ser -out -degr ee ( 0.040) user -age ( 0.015)
9 cont ent -r efer r al -count ( 0.038) cont ent -p ol ar i t y ( 0.015)
Posts
10 cont ent -l engt h ( 0.020) cont ent -l engt h ( 0.010)
11 cont ent -r ead ab i l i t y ( 0.018) cont ent -com pl exi t y ( 0.004)
12 user -age ( 0.015) cont ent -noun-count ( 0.002)
13 cont ent -upp er case-count ( 0.012) cont ent -r eadabi l i t y ( 0.001)
14 cont ent -noun-count ( 0.010) cont ent -ver b-count ( 0.001)
• What is the correlation between seed posts
15
16
cont ent -ad j -count ( 0.005)
cont ent -com pl exi t y ( 0.0)
cont ent -ad j -count ( 0.0)
cont ent -i nfor m at i veness ( 0.0)
and features? 17 cont ent -ver b-cou nt ( 0.0) cont ent -upp er case-count ( 0.0)
Predicting Discussions onof t op-5 feat Semantic Web
F i g. 3. Cont ribut ions the Social ures t o ident ifying Non-seeds (N ) and Seeds(S). 14
Upper plot s are for t he Hait i dat aset and t he lower plot s are for t he Union A ddress
16. Identifying Seed
Posts
• Can we identify seed posts using the top-k
features?
Predicting Discussions on the Social Semantic Web 15
17. Predicting
Discussion Activity
• From identified seed posts:
– Can we predict the level of discussion activity?
– How much activity will a post generate?
• [Wang & Groth, 2010] learns a regression
model, and reports on coefficients
– Identifying relationship between features
• We do something different:
– Predict the volume of the discussion
Predicting Discussions on the Social Semantic Web 16
18. Predicting
Discussion Activity
Predicting Discussions on the Social Semantic Web 17
19. feat ures. Using the predict ed values for each post we can then form a ranking,
Predicting
which is comparable to a ground t rut h ranking wit hin our dat a. T his provides
discussion activity levels, where posts are ordered by t heir expected volume. T his
Discussion Activity
approach also enables context ual predict ions where a post can be compared wit h
existing posts t hat have produced lengthy debates.
• Compare rankings
5.1 ExpGround s
– er im ent truth vs predicted
D at• Experiments
aset s For our experiment s we used t he same dat aset s as in t he previous
section: – Using Haiti from the Haiti Address tdatasets
tweets collected and Union crisis and he Union Address speech. We
maintain t he same split s as before - t raining/ validat ion/ t est ing wit h a 70/ 20/ 10
– Evaluation measure: Normalised Discounted
split - but without using t he t est split . Inst ead we t rain t he regression models
Cumulative Gain
using t he seed post s in t he t raining split and t hen test t he predict ion accuracy
using the seed post s in the validat ion split k={1,5,10,20,50,100)
• Assessing nDCG@k where - seed post s in the validation set
are identified using Support Vectorrained using botwith: content feat ures.
– Tested the J48 classifier t Regression h user+
Table 5 describes our dat aset s user+content features
• user, content, for clarification.
T abl e 5. Discussion volumes and dist ribut ions of our dat aset s
D at aset T r ai n Si ze T est Si ze T est Vol M ean T est Vol SD
H ai t i 980 210 1.664 3.017
U nion A ddr ess 5,067 1,161 1.761 2.342
Predicting Discussions on the Social Semantic Web 18
20. Predicting
Discussion Activity
features. Alt hough different coefficient s are learnt for each dat aset, for major
feat ures with greater weights the signs remain the same. In the case of the list-
degree of t he user, which yielded high IGR during classificat ion, t here is a similar
posit ive associat ion with the discussion volume - t he same is also t rue for t he
in-degree and t he out -degree of t he user. T his indicates t hat a const ant increase
in the combinat ion of a user’s in-degree, out-degree, and list-degree will lead t o
increased discussion volumes. Out -degree plays an import ant role by enabling
t he seed post aut hor t o see post ed responses - given t hat the larger t he out -degree
the greater the reception of informat ion from ot her users.
T abl e 6. Coefficient s of user feat ures learnt using Support Vect or Regression over t he
two Dat aset s. Coefficient s are rounded t o 4 dp.
user -num -p ost s user -out -degr ee user -in-degr ee user -l ist -degr ee user -age user -p ost -r at e
H ai t i -0.0019 + 0.001 + 0.0016 + 0.0046 + 0.0001 + 0.0001
U ni on -0.0025 + 0.0114 + 0.0025 + 0.0154 -0.0003 -0.0002
6 Conclusions on the Social Semantic Web
Predicting Discussions 19
21. Findings
• User reputation and standing is crucial
– eliciting a response
– starting a discussion
• Greater broadcast capability = greater
likelihood of response
– More listeners = more discussion
• Activity levels influenced by out-degree
– Allow the poster to see response from
‘respected’ peers
Predicting Discussions on the Social Semantic Web 20
22. Conclusions
• Pre-empt discussions to empower
– Market analysts
– Opinion mining
– eGovernment policy makers
• Behaviour ontology
– Captures impact across platforms
• Approach accurately predicts:
– Which posts will yield a reply, and;
– The level of discussion activity
Predicting Discussions on the Social Semantic Web 21
23. Current and Future
Work
• Experiments over a forum dataset
– Content features >> user features
– Different platform dynamics
• Extend experiments to a random Twitter
dataset
• Extension to behaviour ontology
– Captures concentration
– i.e. focus of a user on specific topics
• Categorising users by role
– Based on observed behaviour
Predicting Discussions on the Social Semantic Web 22
Provision of a range of applicationsLarge-scale uptake of smart phones
How can I track discussions?
What features are correlated with discussions?
User features more important than content featuresSignificant at alpha = 0.01User reputation therefore more important than the quality of content!
Over training split!Pearsoncorrelation: 0.674 = Fair agreement
Ran experiments using j48 with all features, testing on the test split
Fitted Kolmogorov-Smirnov goodness of fit test
For haitiContent features more important than user features at higher ranksUser features show greater performance as ranks are increasedUnion:User features outperform content for all ranks apart from topIndicates the differences in the dynamics of the data