WordPress Websites for Engineers: Elevate Your Brand
Factors Impacting Attention in Online Forums Vary by Community
1. Ignorance isn't Bliss: An Empirical Analysis of
Attention Patterns in Online Communities
Claudia Wagner, Matthew Rowe, Markus Strohmaier and Harith Alani
Amsterdam, 16.4.2012
3. 3
Motivation
Which factors impact how much attention a post gets?
We use the number of replies as a proxy measurment of attention
4. Research Questions
Which factors impact the attention level a post
gets in certain community forums?
How do these factors differ between individual
community forums?
5. 5
Methodology
Empirical study of attention patterns in 20
randomly selected forums
Two-stage approach
Differentiate between threadstarter posts that got at
least one reply (seed posts) and threadstarter posts
which got no replies at all (non-seed posts)
Predict the level of attention that seed posts will
generate - i.e. the number of replies
8. Feature Engineering
Aim
Identify the features that impact upon seeding a
discussion
Identify features associated with seed posts that
generate the most attention
Five Feature Groups
9. Five Feature Groups
User Features
user account age, post count, in-degree, out-degree, post rate
Content Features
post length, complexity, readability, link count, time in day,
informativeness, polarity
Title Features
Length, question marks, linguistic dimensions (LIWC)
Focus Features
Forum entropy, forum likelihood, topic entropy, topic likelihood, topic
distance
Community Features
Topical community fit, topical community distance, evolution score,
inequity score
10. Feature Computation
For each threadstarter post published in one of the
20 randomly selected forums in 2006 we
computed our 28 features
m1
6 month
Fit LDA model with
standard parameter
T=50, beta=0.01, alpha=50/T
11. Seed Post Identification
11
Experiment
Identify Posts which got replies (Binary
Classification Task)
Split data of each forum into train and test data
(80/20)
Train a logistic regression classifier with each feature
group in isolation and all features combined
Compare performance by using F1 score and the
Matthews correlation coefficient (MCC)
12. Seed Post Identification
12
Results
For these 9 forums our classifiers outperforms the random baseline:
Astronomy & Space: a classifier trained with content features alone
performs best
Spanish: a classifier trained with title features alone performs best
13. Seed Post Identification
13
Feature Impact
Analyze impact of individual features rather than
groups
Interpret statistically significant coefficients of the best
performing feature group learned by the logistic
regression model
Rank the features of the best performing feature
group using the Information Gain Ratio (IGR) as a
ranking criterion
14. Seed Post Identification
14
Observations
In Spanish community the title length is the most
important features (IGR=0.558, coef=-0.326)
Posts with long titles are less likely to get replies
In the Bank & Insurance forum short but complex
posts which are authored by newbies are most
likely to get replies
Content length coef=-0.017, p< 0.05
Topic distance coef=2.890, p<0.01
Complexity has highest IGR (IGR=0.354)
15. Seed Post Identification
15
Observations
Number of links has a negative impact in forum
Work & Jobs and Golf, but a positive impact in the
Astronomy & Space forum
Purpose of community
Links have a positive impact in content and
information driven communities
Links have a negative impact in other communities
16. Seed Post Identification
16
Observations
Some communities require posts to fit to the topics
they usually discuss (e.g., Golf) while others are
more open to diverse topics (e.g., Work & Jobs)
Specificity of community’s subject
Subject of Work &Jobs forum is very general high
topical community distance has a positive impact
Subject of Golf forum is very specific high
community distance has a negative impact
17. Activity Level Prediction
17
Experiment
Identify the features that were correlated with
lengthy discussions
Rank posts according to their attention level
Evaluate our predicted rank using normalized
Discounted Cumulative Gain (nDCG) at varying
rank positions i.e. top-k where k={1, 5, 10, 20, 50,
100}
nDCG = DCG of the predicted ranking divided by
DCG the actual rank
18. Activity Level Prediction
18
Results
Aver
AVERAGED NORMALISED DISCOUNTED CUMULATIVE GAIN
A value of 1 indicates that the predicted ranking of posts perfectly matched their
real ranking.
19. Activity Level Prediction
19
Results
Aver
For the Astronomy & Space community content features were best
for identifying seed posts and are also best for ranking posts
according to the attention level they will generate.
20. Activity Level Prediction
20
Results
Aver
Golf forum (343)
Combination of all features worked best for identifying seed posts.
Focus features alone are best for ranking posts.
21. Activity Level Prediction
21
Results
Aver
Bank & Insurance forum (544)
Combination of all features worked best for identifying seed posts.
Community features alone are best for ranking posts.
22. Activity Level Prediction
22
Summary
Factors that impact discussion initiation often
differ from the factors that impact discussion
length
e.g. for the Golf community
Seed Posts = all features
Activity level = focus features
23. Activity Level Prediction
23
Summary
Factors that are associated with lengthy
discussion tend to be different for different
communities
The title length is the only feature which has a
slightly significant positive impact across several
communities on the number of replies a post gets
Work & Jobs forum title length coef=0.034 and p<0.01
Satellite forum titles length coef =0.030 and p<0.05
24. 24
Conclusions (1)
Different community forums exhibit interesting
differences in terms of how attention is generated
Most attention patterns which we identified are
local and community-specific
“Global” patterns may highly depend on
composition of dataset
25. 25
Conclusions (2)
Same features that have a positive impact on
the start of discussions in one community can
have a negative impact in another community
Example: number of links
Negative impact in most communities
Positive impact in information and content driven
communities
26. 26
Conclusions (3)
Purpose of community and specificity of
community’s subject may impact their reply
behavior
Communities which have a supportive purpose are
most likely driven by different factors than
communities with an informational purpose.
Communities around very specific topics require posts
to fit to the topical focus. Communities around more
general topics do not have this requirement.
27. 27
Limitations & Future Work
Correlation versus Causality
We cannot answer the „what would have happened if“
question with our approach
Controlled experiments where platform is manipulated
Most attention patterns are lokal. But how lokal?
Can we automatically identify the context in which
attention patterns may hold?
28. Attention patterns tend to be local and community-specific.
Ignoring communities’ idiosyncrasies isn’t a bliss.
Experimental Setup
THANK YOU
claudia.wagner@joanneum.at
http://claudiawagner.info
src: http://adobeairstream.com/green/a-natural-predicament-sustainability-in-the-21st-century/
Hinweis der Redaktion
We randomly selected 20 forums that did not have low activity levels. One can see that the set of forums which we selected is very diverse and includes communities around very specific topics such as Golf or Astronomy & Space and communities around Geographical locations such as Ripp of Ireland, and communities around very general topics such as Work&Jobs
Since we were interested in exploring different factors we had to develop feature groups which represent the factors which may impact users‘ reply behavior.We created 5 different groups of features which try to explain factor-groups which may potentially impact users‘ communication behavior in certain community ofurms. For example if user features are important in a forum for predicting which posts will get replied than that means that in this forum ist more important who says sth rather than what is said. That means disucssion would be driven by social factors rather than topical factors.On the other hand if content features are most important in a forum than that means that posts need to show certain content characteristics in order to get replies.Focus Features are somehow also user features but describe the topical and forum focus of a user. For some forums it might be necessary that a user has a strong topical focus (i.e. is likely to be an expert) in order to stimulate discussions while in other forums novices might be more likely to get replies.Community features describe relations between a post or its author and the community – e.g. a post might only get replies if it fits ths interests of the community or a user might be more likely to get replied if he has contributed to the community a lot (inequity theaory).
Since we were interested in exploring different factors we had to develop feature groups which represent the factors which may impact users‘ reply behavior.We created 5 different groups of features which try to explain factor-groups which may potentially impact users‘ communication behavior in certain community ofurms. For example if user features are important in a forum for predicting which posts will get replied than that means that in this forum ist more important who says sth rather than what is said. That means disucssion would be driven by social factors rather than topical factors.On the other hand if content features are most important in a forum than that means that posts need to show certain content characteristics in order to get replies.Focus Features are somehow also user features but describe the topical and forum focus of a user. For some forums it might be necessary that a user has a strong topical focus (i.e. is likely to be an expert) in order to stimulate discussions while in other forums novices might be more likely to get replies.Community features describe relations between a post or its author and the community – e.g. a post might only get replies if it fits ths interests of the community or a user might be more likely to get replied if he has contributed to the community a lot (inequity theaory).
Wecomputedthosefeaturesforeverythreadstarterpublished in 2006 postbyusing a 6 monthwindowprevioustowhenthepost was published.
MCC is a balanced measure of the quality of binary classification and can be used even if the classes are of very different sizes.The MCC measure returns a value between -1 and +1 : 0 is no better than random prediction. The F1 score is frequently used by the IR community, while the MCC is used by ML people.
For 11 forums our classifier did not outperform (but only matched) the performance of the baseline. We assume that thishappens because most of these 11 forums are rather inactive forums. Another potential explanation is that the discussion behaviour of these communities is in part rather random and/or driven by other, external factors which we could not take into account in our study. For example the discussion behaviour of the communities around specificlocations or regions might for example be impacted by spatial properties of users while the discussion behaviour of the community around forum Television seems to be mainly driven by external events (e.g. start of a new series).In most cases a combination of all features achieves the highest performance
Besidetheoverallclassificationperformancewewere also interested in analyzingtheimpactofindiviualfeatures
Whenanalyzingthe individual featureswemade a coupleofinterestingobservations such as
NDCG wouldbe 1 ifwepredictthe realrankingpostionof a post. The measurepenalizeselementsthatappearlower down altoughtheyshouldbehigherup.
Best resultsforSpanishforum.Worstresultsfor 544 (Banking & Insurance & Pensions)
This indicates that it is important that a post’s content has certain characteristics (e.g. contains only few links) and fits the topical interests of the community in order to start a discussion.But afterwards it is important that the author of a post has certain topical and/or forum focus in order to stimulate a lengthy discussion in this forum.
This indicates that for starting lengthy discussions in this forum it is important that the author of a post has topical and/or forum focus.
This indicates that that in this forum posts which fit to the topical interests of the community have the potential to start lengthy discussions.
Tosummerizeoursecondexperimentshowsthat
So letmestartconcludingmytalk. Whatwelearnedfromourempiricalstudy was that...