Online social communities like discussion boards and message boards are fast evolving in
their usage bringing people with similar interests together. From a social and anthropological standpoint, these are the most interesting to study compared to Online Social Networks because they connect people (most often with no offline links) from different backgrounds and histories.
Various theories exist in sociology about the intended behavior of users in online forums. In this paper, we study the applicability of one such theory - “Participation Inequality” on financial message boards. We consider the activity of user, his network interaction structure and the content of postings and employ Machine Learning techniques to identify, cluster and infer roles of users exhibiting similar behavior.
4. Objective
• To identify the roles users assume in these message board forums.
• Validate the “90-9-1 Rule for Participation Inequality” in the message
boards community.
5. Dataset
• Free US listed stocks message boards
• Time Period: January, 2001 - June, 2015
• Total Message Boards: 6,278
• Total Users: 52,558
• Total Posts: 5,624,024
6. Dataset Analysis
• Percentage of initiated
posts: 30%
• 19% of users did not initiate
any post.
• 80% of users initiated less
than 20 posts.
7. Dataset Analysis
• Number of boards user
participated in:
• 56% of users are active only
on 1 board.
• 90% of users are limited to/
active on less than 20
boards.
8. Dataset Analysis
• Average response time of
replies a user makes:
0
20
40
60
80
100
<1min [1min,1h) [1h,1d) [1d,1w) ³1w
FractionofUsers(%)
Average Response Time of Replies
0.18%
23.06%
53.13%
15.77%
7.86%
9. Dataset Analysis
• Number of posts across
boards:
• 80% of posts made on less
than 200 boards.
• 1000 out of 6278 boards
account for 90% of posts
made.
10. Dataset Analysis
• Percentage of initiated
posts: 30%
• From the graph we infer,
• 19% of users did not initiate
any post.
• 80% of users initiated less
than 20 posts.
11. Features
1. Number of threads a user initiated over time
2. Number of replies a user made over time
3. Number of users a user replies to
4. Number of users who reply to a user
5. Number of boards a user is active on
6. Number of followers
7. Replier share , AVG[proportion of replies a user gets on a board]
8. Reply share, AVG[proportion of reply a user makes on a board]
9. Average Response time
10. Volume of content he posted
11. Number of links he has posted
Content
Related
User Network
StructureActivity of User
14. Feature Selection
• Step 1 – Feature Extraction
• Do Principal Component Analysis
• Do K-means on the projected data and extract feature labels
• Step 2 – Feature importance using Random Forest classifier
17. Elbow Plot
• Plot the Within Group Sum of Squares
versus K, and look at the “elbow-
point” in the plot.
• The first clusters will add much
information (explain a lot of variance),
but at some point the marginal gain
will drop, giving an angle in the graph.
• Choose the number after the last big
drop.
• This "elbow" cannot always be
unambiguously identified.
18. Silhouette Coefficient
a(i) is the average dissimilarity
of with all data within the
same cluster.
b(i) is the lowest average
dissimilarity of to any other
cluster, of which is not a
member.
19. Feature Selection
• Train a Random Forest classifier
using all the features and labels
assigned by K-means.
• Feature importance is defined as
the total decrease in node
impurity (weighted by the
probability of reaching that node
,which is approximated by the
proportion of samples reaching
that node) averaged over all trees
of the ensemble.
20. Clustering Users
• Applied K-Means clustering with K=4.
• Run 10 times with different seeds.
• 300 iterations in a single run.
Clusters User Count % of Users
Cluster 1 47295 91.7
Cluster 2 360 0.73
Cluster 3 3322 6.44
Cluster 4 581 1.13
21. Cluster Analysis
Initiation of Posts by users of each cluster
Cluster 1
30%
Cluster 2
22%
Cluster 3
45%
Cluster 4
3%
Post Initiation Share
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 1
Cluster 2
Cluster 3
Cluster 4
0
200
400
600
800
1000
1200
10.9
1066.6
228.2
98.3
Initiation Per User
26. Role Inference
• Cluster1: Lurkers
• The post initiated per user and reply made per user ratio are very less.
• Cluster2: Super Users
• Very active. Contribute most to the boards. Engage with lot of users.
• Cluster3: Contributors
• Account for 45% of total post initiations, 46% of total replies made. Have a
high response time meaning they respond very fast. Backbone of the forum.
• Cluster4: Taciturns
• Limited to themselves. Initiate very less but reply often mostly to users in
their own cluster.
28. Conclusion
• Users take up different roles on online communities and the cluster of
users can be identified by their behavioral pattern.
• Participation Inequality exists on financial message boards.
29. Conclusion
• Users take up different roles on online communities and the cluster of
users can be identified by their behavioral pattern.
• Participation Inequality exists on financial message boards as well.
are they :
reinforcing opinions (echo chambers),
confirmation bias (selective filtering people fail to filter out information that fails to support their opinions), etc.
Why these features?
Sociologists have studied different kind of roles.
Assign label to each feature. Reference paper for that.
Do PCA
Do K-means to get the labels and select best K
Do Decision Tree to get feature importance
Do K-means again to get final labels
Infer roles from the graphs
7 users moved from cluster 2 to cluster 3. rest same.