SlideShare ist ein Scribd-Unternehmen logo
1 von 30
User Behavior Modeling on
Financial Message Boards
Pritha D.N
Sahaj Biyani
December 9, 2015
Introduction
Investors Hub
Objective
• To identify the roles users assume in these message board forums.
• Validate the “90-9-1 Rule for Participation Inequality” in the message
boards community.
Dataset
• Free US listed stocks message boards
• Time Period: January, 2001 - June, 2015
• Total Message Boards: 6,278
• Total Users: 52,558
• Total Posts: 5,624,024
Dataset Analysis
• Percentage of initiated
posts: 30%
• 19% of users did not initiate
any post.
• 80% of users initiated less
than 20 posts.
Dataset Analysis
• Number of boards user
participated in:
• 56% of users are active only
on 1 board.
• 90% of users are limited to/
active on less than 20
boards.
Dataset Analysis
• Average response time of
replies a user makes:
0
20
40
60
80
100
<1min [1min,1h) [1h,1d) [1d,1w) ³1w
FractionofUsers(%)
Average Response Time of Replies
0.18%
23.06%
53.13%
15.77%
7.86%
Dataset Analysis
• Number of posts across
boards:
• 80% of posts made on less
than 200 boards.
• 1000 out of 6278 boards
account for 90% of posts
made.
Dataset Analysis
• Percentage of initiated
posts: 30%
• From the graph we infer,
• 19% of users did not initiate
any post.
• 80% of users initiated less
than 20 posts.
Features
1. Number of threads a user initiated over time
2. Number of replies a user made over time
3. Number of users a user replies to
4. Number of users who reply to a user
5. Number of boards a user is active on
6. Number of followers
7. Replier share , AVG[proportion of replies a user gets on a board]
8. Reply share, AVG[proportion of reply a user makes on a board]
9. Average Response time
10. Volume of content he posted
11. Number of links he has posted
Content
Related
User Network
StructureActivity of User
Methodology
• Data Preprocessing
• Feature Selection/Extraction
• Clustering
• Role Inference
Data Preprocessing
• We use Min-Max Normalization
• Normalize data between [0 – 1]
Feature Selection
• Step 1 – Feature Extraction
• Do Principal Component Analysis
• Do K-means on the projected data and extract feature labels
• Step 2 – Feature importance using Random Forest classifier
Principle
Component
% Variance
Cumulative %
Variance
1 62.16 62.16
2 15.07 77.23
3 07.95 85.18
4 05.74 90.92
5 03.57 94.49
6 01.67 96.16
7 01.48 97.64
8 00.68 98.32
9 00.59 98.91
10 00.55 99.46
11 00.54 100
Feature Extraction using PCA
Scree Plot
Choosing the number of clusters
Elbow Plot
• Plot the Within Group Sum of Squares
versus K, and look at the “elbow-
point” in the plot.
• The first clusters will add much
information (explain a lot of variance),
but at some point the marginal gain
will drop, giving an angle in the graph.
• Choose the number after the last big
drop.
• This "elbow" cannot always be
unambiguously identified.
Silhouette Coefficient
a(i) is the average dissimilarity
of with all data within the
same cluster.
b(i) is the lowest average
dissimilarity of to any other
cluster, of which is not a
member.
Feature Selection
• Train a Random Forest classifier
using all the features and labels
assigned by K-means.
• Feature importance is defined as
the total decrease in node
impurity (weighted by the
probability of reaching that node
,which is approximated by the
proportion of samples reaching
that node) averaged over all trees
of the ensemble.
Clustering Users
• Applied K-Means clustering with K=4.
• Run 10 times with different seeds.
• 300 iterations in a single run.
Clusters User Count % of Users
Cluster 1 47295 91.7
Cluster 2 360 0.73
Cluster 3 3322 6.44
Cluster 4 581 1.13
Cluster Analysis
Initiation of Posts by users of each cluster
Cluster 1
30%
Cluster 2
22%
Cluster 3
45%
Cluster 4
3%
Post Initiation Share
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 1
Cluster 2
Cluster 3
Cluster 4
0
200
400
600
800
1000
1200
10.9
1066.6
228.2
98.3
Initiation Per User
Cluster Analysis
Replies by users of each cluster
Cluster 1
22%
Cluster 2
28%
Cluster 3
46%
Cluster 4
4%
Reply Share
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Cluster 1
Cluster 2
Cluster 3
Cluster 4
0
500
1000
1500
2000
2500
3000
17.5
2946.3
534.9
255.9
Reply Per User
Cluster Analysis34
22
24
18
20
27
11
25
41
49
22
44
4
2
42
2
CLUSTER1 CLUSTER2 CLUSTER3 CLUSTER4
INTER-CLUSTER REPLY %
Cluster1 Cluster4 Cluster2 Cluster3
Cluster Analysis
Feature 3: Number of users a user replies to
Cluster Analysis
Feature 4: Number of users who reply to a user
Role Inference
• Cluster1: Lurkers
• The post initiated per user and reply made per user ratio are very less.
• Cluster2: Super Users
• Very active. Contribute most to the boards. Engage with lot of users.
• Cluster3: Contributors
• Account for 45% of total post initiations, 46% of total replies made. Have a
high response time meaning they respond very fast. Backbone of the forum.
• Cluster4: Taciturns
• Limited to themselves. Initiate very less but reply often mostly to users in
their own cluster.
Participation Inequality
91.73
24
0.73
26
6.44
46
1.13
4
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
% of Users Content Contributed
Lurkers Super-Users Contributors Taciturns
Conclusion
• Users take up different roles on online communities and the cluster of
users can be identified by their behavioral pattern.
• Participation Inequality exists on financial message boards.
Conclusion
• Users take up different roles on online communities and the cluster of
users can be identified by their behavioral pattern.
• Participation Inequality exists on financial message boards as well.
Thank You!

Weitere ähnliche Inhalte

Andere mochten auch (7)

Proceso de conformado
Proceso de conformadoProceso de conformado
Proceso de conformado
 
The beach in Hua Hin Thailand by Ingemar Pongratz
The beach in Hua Hin Thailand by Ingemar PongratzThe beach in Hua Hin Thailand by Ingemar Pongratz
The beach in Hua Hin Thailand by Ingemar Pongratz
 
Pintura em azulejo
Pintura em azulejoPintura em azulejo
Pintura em azulejo
 
ViziRecruiter Deck
ViziRecruiter DeckViziRecruiter Deck
ViziRecruiter Deck
 
Cadbury
CadburyCadbury
Cadbury
 
Video Prevención de embarazo en adolescentes
Video Prevención de embarazo en adolescentesVideo Prevención de embarazo en adolescentes
Video Prevención de embarazo en adolescentes
 
Geologia de la tierra compu
Geologia de la tierra compuGeologia de la tierra compu
Geologia de la tierra compu
 

Ähnlich wie User Behavior Modeling on Financial Message Boards

OpenStack User Committee - Havana Summit
OpenStack User Committee - Havana SummitOpenStack User Committee - Havana Summit
OpenStack User Committee - Havana Summit
OpenStack Foundation
 

Ähnlich wie User Behavior Modeling on Financial Message Boards (20)

Repetition and rhythmicity based assessment model for chat conversations pr...
Repetition and rhythmicity based assessment model for chat conversations   pr...Repetition and rhythmicity based assessment model for chat conversations   pr...
Repetition and rhythmicity based assessment model for chat conversations pr...
 
Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...
 
OpenStack User Committee - Havana Summit
OpenStack User Committee - Havana SummitOpenStack User Committee - Havana Summit
OpenStack User Committee - Havana Summit
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Parts 1 & 2: WWW 2018 Tutorial: Understanding User Needs & Tasks
Parts 1 & 2: WWW 2018 Tutorial: Understanding User Needs & TasksParts 1 & 2: WWW 2018 Tutorial: Understanding User Needs & Tasks
Parts 1 & 2: WWW 2018 Tutorial: Understanding User Needs & Tasks
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...
 
Twitter Case Study
Twitter Case StudyTwitter Case Study
Twitter Case Study
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining Techniques
 
Social media analysis: A Discussion Forum on Weight Loss
Social media analysis: A Discussion Forum on Weight LossSocial media analysis: A Discussion Forum on Weight Loss
Social media analysis: A Discussion Forum on Weight Loss
 
Letting Users Choose Recommender Algorithms
Letting Users Choose Recommender AlgorithmsLetting Users Choose Recommender Algorithms
Letting Users Choose Recommender Algorithms
 
Sbst2018 contest2018
Sbst2018 contest2018Sbst2018 contest2018
Sbst2018 contest2018
 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Saner17 sharma
Saner17 sharmaSaner17 sharma
Saner17 sharma
 
Debugging Skynet: A Machine Learning Approach to Log Analysis - Ianir Ideses,...
Debugging Skynet: A Machine Learning Approach to Log Analysis - Ianir Ideses,...Debugging Skynet: A Machine Learning Approach to Log Analysis - Ianir Ideses,...
Debugging Skynet: A Machine Learning Approach to Log Analysis - Ianir Ideses,...
 
Management and analysis of social media data
Management and analysis of social media dataManagement and analysis of social media data
Management and analysis of social media data
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
Facebook Comments Volume Prediction
Facebook Comments Volume PredictionFacebook Comments Volume Prediction
Facebook Comments Volume Prediction
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
 

Kürzlich hochgeladen

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 

Kürzlich hochgeladen (20)

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 

User Behavior Modeling on Financial Message Boards

  • 1. User Behavior Modeling on Financial Message Boards Pritha D.N Sahaj Biyani December 9, 2015
  • 4. Objective • To identify the roles users assume in these message board forums. • Validate the “90-9-1 Rule for Participation Inequality” in the message boards community.
  • 5. Dataset • Free US listed stocks message boards • Time Period: January, 2001 - June, 2015 • Total Message Boards: 6,278 • Total Users: 52,558 • Total Posts: 5,624,024
  • 6. Dataset Analysis • Percentage of initiated posts: 30% • 19% of users did not initiate any post. • 80% of users initiated less than 20 posts.
  • 7. Dataset Analysis • Number of boards user participated in: • 56% of users are active only on 1 board. • 90% of users are limited to/ active on less than 20 boards.
  • 8. Dataset Analysis • Average response time of replies a user makes: 0 20 40 60 80 100 <1min [1min,1h) [1h,1d) [1d,1w) ³1w FractionofUsers(%) Average Response Time of Replies 0.18% 23.06% 53.13% 15.77% 7.86%
  • 9. Dataset Analysis • Number of posts across boards: • 80% of posts made on less than 200 boards. • 1000 out of 6278 boards account for 90% of posts made.
  • 10. Dataset Analysis • Percentage of initiated posts: 30% • From the graph we infer, • 19% of users did not initiate any post. • 80% of users initiated less than 20 posts.
  • 11. Features 1. Number of threads a user initiated over time 2. Number of replies a user made over time 3. Number of users a user replies to 4. Number of users who reply to a user 5. Number of boards a user is active on 6. Number of followers 7. Replier share , AVG[proportion of replies a user gets on a board] 8. Reply share, AVG[proportion of reply a user makes on a board] 9. Average Response time 10. Volume of content he posted 11. Number of links he has posted Content Related User Network StructureActivity of User
  • 12. Methodology • Data Preprocessing • Feature Selection/Extraction • Clustering • Role Inference
  • 13. Data Preprocessing • We use Min-Max Normalization • Normalize data between [0 – 1]
  • 14. Feature Selection • Step 1 – Feature Extraction • Do Principal Component Analysis • Do K-means on the projected data and extract feature labels • Step 2 – Feature importance using Random Forest classifier
  • 15. Principle Component % Variance Cumulative % Variance 1 62.16 62.16 2 15.07 77.23 3 07.95 85.18 4 05.74 90.92 5 03.57 94.49 6 01.67 96.16 7 01.48 97.64 8 00.68 98.32 9 00.59 98.91 10 00.55 99.46 11 00.54 100 Feature Extraction using PCA Scree Plot
  • 16. Choosing the number of clusters
  • 17. Elbow Plot • Plot the Within Group Sum of Squares versus K, and look at the “elbow- point” in the plot. • The first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. • Choose the number after the last big drop. • This "elbow" cannot always be unambiguously identified.
  • 18. Silhouette Coefficient a(i) is the average dissimilarity of with all data within the same cluster. b(i) is the lowest average dissimilarity of to any other cluster, of which is not a member.
  • 19. Feature Selection • Train a Random Forest classifier using all the features and labels assigned by K-means. • Feature importance is defined as the total decrease in node impurity (weighted by the probability of reaching that node ,which is approximated by the proportion of samples reaching that node) averaged over all trees of the ensemble.
  • 20. Clustering Users • Applied K-Means clustering with K=4. • Run 10 times with different seeds. • 300 iterations in a single run. Clusters User Count % of Users Cluster 1 47295 91.7 Cluster 2 360 0.73 Cluster 3 3322 6.44 Cluster 4 581 1.13
  • 21. Cluster Analysis Initiation of Posts by users of each cluster Cluster 1 30% Cluster 2 22% Cluster 3 45% Cluster 4 3% Post Initiation Share Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 1 Cluster 2 Cluster 3 Cluster 4 0 200 400 600 800 1000 1200 10.9 1066.6 228.2 98.3 Initiation Per User
  • 22. Cluster Analysis Replies by users of each cluster Cluster 1 22% Cluster 2 28% Cluster 3 46% Cluster 4 4% Reply Share Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 1 Cluster 2 Cluster 3 Cluster 4 0 500 1000 1500 2000 2500 3000 17.5 2946.3 534.9 255.9 Reply Per User
  • 23. Cluster Analysis34 22 24 18 20 27 11 25 41 49 22 44 4 2 42 2 CLUSTER1 CLUSTER2 CLUSTER3 CLUSTER4 INTER-CLUSTER REPLY % Cluster1 Cluster4 Cluster2 Cluster3
  • 24. Cluster Analysis Feature 3: Number of users a user replies to
  • 25. Cluster Analysis Feature 4: Number of users who reply to a user
  • 26. Role Inference • Cluster1: Lurkers • The post initiated per user and reply made per user ratio are very less. • Cluster2: Super Users • Very active. Contribute most to the boards. Engage with lot of users. • Cluster3: Contributors • Account for 45% of total post initiations, 46% of total replies made. Have a high response time meaning they respond very fast. Backbone of the forum. • Cluster4: Taciturns • Limited to themselves. Initiate very less but reply often mostly to users in their own cluster.
  • 27. Participation Inequality 91.73 24 0.73 26 6.44 46 1.13 4 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % of Users Content Contributed Lurkers Super-Users Contributors Taciturns
  • 28. Conclusion • Users take up different roles on online communities and the cluster of users can be identified by their behavioral pattern. • Participation Inequality exists on financial message boards.
  • 29. Conclusion • Users take up different roles on online communities and the cluster of users can be identified by their behavioral pattern. • Participation Inequality exists on financial message boards as well.

Hinweis der Redaktion

  1. are they : reinforcing opinions (echo chambers), confirmation bias (selective filtering people fail to filter out information that fails to support their opinions), etc.
  2. Why these features? Sociologists have studied different kind of roles. Assign label to each feature. Reference paper for that.
  3. Do PCA Do K-means to get the labels and select best K Do Decision Tree to get feature importance Do K-means again to get final labels Infer roles from the graphs
  4. 7 users moved from cluster 2 to cluster 3. rest same.