Learning to Classify Users in Online
Interaction Networks
Georgios Rizos, Symeon Papadopoulos, and Yiannis Kompatsiaris
Ce...
User Classification
#2
Twitter Handle Labels
@nytimes usa, press,
new york
@HuffPostBiz finance
@BBCBreaking press,
journa...
User Classification in (and outside) OSNs
#3
OSN
online activities
log filesAPIs
Behaviour
Observation
Profiling/Classific...
Network-based User Classification
• People with similar interests tend to connect
(homophily)
• Knowing about one’s connec...
Related Work: User Classification
Graph-based semi-supervised learning:
• Label propagation (Zhu and Ghahramani, 2002)
• L...
Related Work: Graph Feature Extraction
First attempts at using community detection:
• EdgeCluster: Edge centric k-means (T...
Overview of Framework
#7
Online social interactions
(retweets, mentions, etc.)
Social interaction
user graph
ARCTE
Partial...
Network Features using ARCTE
• Based on user-centric community detection.
• We extract for each user, two types of user-ce...
ARCTE: Toy Example
#9
Fast Approximate User-centric PageRank
• Given a seed user 𝑣, we calculate the user-centric PageRank
vector (i.e. stationa...
Community Weighting
• We perform a supervised community weighting step to
boost the importance of highly predictive commun...
Evaluation: Dataset Description
#12
Datasets Labels Vertices Vertex Type Edges Edge Type
SNOW2014 Graph
(Papadopoulos et a...
Evaluation: SNOW 2014 dataset
#13
SNOW2014 Graph (534K, 950K): Twitter mentions + retweets
ground truth based on Twitter l...
Evaluation: Insight Politics UK
#14
Insight-Multiview-PoliticsUK (419, 11K): mentions + retweets
ground truth based on man...
Evaluation: ASU-YouTube
#15
ASU-YouTube (1.1M, 3M): YouTube subscriptions
ground truth based on membership to groups
Evaluation: ASU-Flickr
#16
ASU-Flickr (80K, 5.9M): Flickr contacts
ground truth based on membership to Flickr groups
Evaluation: Community Weighting
#17
Conclusion
• Key ideas:
– new user feature representation based on user-centric
communities
– community weighting based on...
Thank you!
• Resources:
Slides: http://www.slideshare.net/sympapadopoulos/learning-to-classify-
users-in-online-interactio...
References (1/3)
• Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction
and data representati...
References (2/3)
• Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label
propagation. Tech...
References (3/3)
• Andersen, R., Chung, F., & Lang, K. (2006, October). Local graph
partitioning using pagerank vectors. I...
Auxiliary Slides
#23
Classifying Users using Network Structure
• User-centric community detection to the problem
of graph-based user classifica...
Nächste SlideShare
Wird geladen in …5
×

Learning to Classify Users in Online Interaction Networks

1.074 Aufrufe

Veröffentlicht am

Presentation given at ICCSS 2015, Helsinki, Finland. It illustrates an approach for classifying users of OSNs solely based on their interactions with other users.

Veröffentlicht in: Technologie
0 Kommentare
4 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
1.074
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
93
Aktionen
Geteilt
0
Downloads
8
Kommentare
0
Gefällt mir
4
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie
  • Topics
    Political/social attitudes
    News stories
    Geographical area
    User types/roles

    Useful for news search/discovery
    Potential privacy issues
  • Different kinds of user classification:
    topic-oriented (e.g., interest/expertise)
    role-based/behavioral (e.g., bot/spammer)
    geographical location
    Useful for advertising,
    user recommendation,
    expert search, etc.
    For personal accounts,
    user classification raises
    privacy concerns
    Challenges
    multi-linguality
    Brevity
    informal language
  • http://irevolution.net/2014/04/03/using-aidr-to-collect-and-analyze-tweets-from-chile-earthquake/
  • Learning to Classify Users in Online Interaction Networks

    1. 1. Learning to Classify Users in Online Interaction Networks Georgios Rizos, Symeon Papadopoulos, and Yiannis Kompatsiaris Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI) ICCSS 2015, June 10, 2015, Helsinki, Finland
    2. 2. User Classification #2 Twitter Handle Labels @nytimes usa, press, new york @HuffPostBiz finance @BBCBreaking press, journalist, tv @StKonrath journalist Examples from SNOW 2014 dataset
    3. 3. User Classification in (and outside) OSNs #3 OSN online activities log filesAPIs Behaviour Observation Profiling/Classification
    4. 4. Network-based User Classification • People with similar interests tend to connect (homophily) • Knowing about one’s connections could reveal information about them • Knowing about the whole network structure could reveal even more… #4
    5. 5. Related Work: User Classification Graph-based semi-supervised learning: • Label propagation (Zhu and Ghahramani, 2002) • Local and global consistency (Zhou et al., 2004) • Empirical evaluation of many graph kernels (Fouss et al., 2012) Other approaches to user classification: • Hybrid feature engineering for inferring user behaviors (Pennacchiotti et al., 2011 , Wagner et al., 2013) • Crowdsourcing Twitter list keywords for popular users (Ghosh et al., 2012) • Content-based, graph-regularized NMF for spammer detection (Hu et al., 2013) #5
    6. 6. Related Work: Graph Feature Extraction First attempts at using community detection: • EdgeCluster: Edge centric k-means (Tang and Liu, 2009) • MROC: Binary tree community hierarchy (Wang et al., 2013) Low-rank matrix representation methods: • Laplacian Eigenmaps: k eigenvectors of the graph Laplacian (Belkin and Niyogi, 2003 , Tang and Liu, 2011) • Random-Walk Modularity Maximization: Does not suffer from the resolution limit of ModMax (Devooght et al., 2014) • Deepwalk: Deep representation learning (Perozzi et al., 2014) #6
    7. 7. Overview of Framework #7 Online social interactions (retweets, mentions, etc.) Social interaction user graph ARCTE Partial/Sparse Annotation Unsupervised graph feature representation Supervised graph feature representation Feature Weighting User Label Learning Classified Users
    8. 8. Network Features using ARCTE • Based on user-centric community detection. • We extract for each user, two types of user-centric communities. • Base user-centric community: 𝑐 𝑣 = 𝑁(𝑣) ∪ 𝑣 • Extended user-centric community: Consider a vector 𝑝 𝑣 that contains similarity values among the seed user 𝑣 and all the rest of the users. – By truncating appropriately, we can keep a community of the most similar users to the seed 𝑣. – We keep the fewest possible users such that we still include the seed user’s direct neighbors. • Denote the set of communities detected by 𝐶. We form the feature matrix 𝑋 as follows: 𝑥 𝑣𝑖 = 1, 𝑖𝑓𝑣 ∈ 𝑐𝑖 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 , ∀𝑐𝑖 ∈ 𝐶 #8
    9. 9. ARCTE: Toy Example #9
    10. 10. Fast Approximate User-centric PageRank • Given a seed user 𝑣, we calculate the user-centric PageRank vector (i.e. stationary distribution with probability 1 at 𝑣). • Localized, sparse vector; i.e. we neither propagate nor store trivial values. • Instead of approximating the PageRank vector, we approximate cumulative PageRank differences. Better approximation for fewer iterations. • We alternate between two update rules: – Cumulative PR diff: 𝑝(𝑡+1) = 𝑝(𝑡) + 1 − 𝜌 𝑟(𝑡−1) 𝑊𝑢 (instead of PR: 𝑝(𝑡+1) = 𝑝(𝑡) + 𝑟(𝑡) 𝐼 𝑢, (Andersen et al., 2006)) – Residual distribution: 𝑟(𝑡+1) = 𝑟(𝑡) − 𝑟(𝑡) 𝐼 𝑢 + (1 − 𝜌)𝑟(𝑡) 𝑊𝑢 where 𝜌: Restart probability and 𝑊𝑢 the 𝑢-th row of 𝑊 = 𝐷−1 𝐴 and 𝐼 𝑢 the 𝑢-th row of 𝐼 • Finally, we divide each element of 𝑝 by its degree in order to get approximate, user-centric, regularized commute-times. #10
    11. 11. Community Weighting • We perform a supervised community weighting step to boost the importance of highly predictive communities. • For each community we calculate a weight: 𝑤 𝑑 = 𝜒2 𝑖 × 𝑖𝑣𝑓(𝑖) • The first factor is based on supervised chi-squared weighting that quantifies the correlation among all feature-label pairs. – PSNR aggregation across labels: 𝜒2 𝑖 = max 𝜒 2 𝑖,𝑙 −min( 𝜒2 𝑖,𝑙 ) 𝑤𝑖𝑡ℎ𝑖𝑛−𝑙𝑎𝑏𝑒𝑙−𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 • The second factor is unsupervised inverse vertex frequency. – Consider idf with vertices as terms and communities as documents. • We multiply each column of 𝑋 with the corresponding weight. #11
    12. 12. Evaluation: Dataset Description #12 Datasets Labels Vertices Vertex Type Edges Edge Type SNOW2014 Graph (Papadopoulos et al., 2014) 90 533,874 Twitter Account 949,661 Mentions + Retweets IRMV-PoliticsUK (Greene & Cunningham, 2013) 5 419 Twitter Account 11,349 Mentions + Retweets ASU-YouTube (Mislove et al., 2007) 47 1,134,890 YouTube Channel 2,987,624 Subscriptions ASU-Flickr (Tang and Liu, 2009) 195 80,513 Flickr Account 5,899,882 Contacts Ground truth generation: • SNOW2014 Graph: Twitter list aggregation & post-processing • IRMV-PoliticsUK: Manual annotation • ASU-YouTube: User membership to group • ASU-Flickr: User subscription to interest group
    13. 13. Evaluation: SNOW 2014 dataset #13 SNOW2014 Graph (534K, 950K): Twitter mentions + retweets ground truth based on Twitter list processing
    14. 14. Evaluation: Insight Politics UK #14 Insight-Multiview-PoliticsUK (419, 11K): mentions + retweets ground truth based on manual annotation
    15. 15. Evaluation: ASU-YouTube #15 ASU-YouTube (1.1M, 3M): YouTube subscriptions ground truth based on membership to groups
    16. 16. Evaluation: ASU-Flickr #16 ASU-Flickr (80K, 5.9M): Flickr contacts ground truth based on membership to Flickr groups
    17. 17. Evaluation: Community Weighting #17
    18. 18. Conclusion • Key ideas: – new user feature representation based on user-centric communities – community weighting based on sparse annotations – consistently good performance both on interaction (mention/retweet) and affiliation (follow/subscribe) graphs • Future Work: – integration of additional signals (content) – investigating feasibility on other classification problems, e.g. spammer detection #18
    19. 19. Thank you! • Resources: Slides: http://www.slideshare.net/sympapadopoulos/learning-to-classify- users-in-online-interaction-networks Code: https://github.com/MKLab-ITI/reveal-user-classification https://github.com/MKLab-ITI/reveal-user-annotation • Get in touch: @sympapadopoulos / papadop@iti.gr @georgios_rizos / georgerizos@iti.gr #19
    20. 20. References (1/3) • Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6), 1373-1396. • Tang, L., & Liu, H. (2011). Leveraging social media networks for classification. Data Mining and Knowledge Discovery, 23(3), 447-478. • Devooght, R., Mantrach, A., Kivimäki, I., Bersini, H., Jaimes, A., & Saerens, M. (2014, April). Random walks based modularity: application to semi-supervised learning. In Proceedings of the 23rd international conference on World wide web (pp. 213-224). International World Wide Web Conferences Steering Committee. • Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM. • Tang, L., & Liu, H. (2009, November). Scalable learning of collective behavior based on sparse social dimensions. In Proceedings of the 18th ACM conference on Information and knowledge management (pp. 1107-1116). ACM. • Wang, X., Tang, L., Liu, H., & Wang, L. (2013). Learning with multi-resolution overlapping communities. Knowledge and information systems, 36(2), 517-535. #20
    21. 21. References (2/3) • Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University. • Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. Advances in neural information processing systems, 16(16), 321-328. • Fouss, F., Francoisse, K., Yen, L., Pirotte, A., & Saerens, M. (2012). An experimental investigation of kernels on graphs for collaborative recommendation and semisupervised classification. Neural Networks, 31, 53-72. • Pennacchiotti, M., & Popescu, A. M. (2011, August). Democrats, republicans and starbucks afficionados: user classification in twitter. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 430-438). ACM. • Ghosh, S., Sharma, N., Benevenuto, F., Ganguly, N., & Gummadi, K. (2012, August). Cognos: crowdsourcing search for topic experts in microblogs. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (pp. 575-590). ACM. • Hu, X., Tang, J., Zhang, Y., & Liu, H. (2013, August). Social spammer detection in microblogging. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence (pp. 2633-2639). AAAI Press. • Wagner, C., Asur, S., & Hailpern, J. (2013, September). Religious politicians and creative photographers: Automatic user categorization in twitter. In Social Computing (SocialCom), 2013 International Conference on (pp. 303-310). IEEE. #21
    22. 22. References (3/3) • Andersen, R., Chung, F., & Lang, K. (2006, October). Local graph partitioning using pagerank vectors. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on (pp. 475-486). IEEE. • Papadopoulos, S., Corney, D., & Aiello, L. M. (2014). SNOW 2014 Data Challenge: Assessing the Performance of News Topic Detection Methods in Social Media. In SNOW-DC@ WWW (pp. 1-8). • Greene, D., & Cunningham, P. (2013, May). Producing a unified graph representation from multiple social network views. In Proceedings of the 5th Annual ACM Web Science Conference (pp. 118-121). ACM. • Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P., & Bhattacharjee, B. (2007, October). Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (pp. 29-42). ACM. • Tang, L., & Liu, H. (2009, June). Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 817-826). ACM. #22
    23. 23. Auxiliary Slides #23
    24. 24. Classifying Users using Network Structure • User-centric community detection to the problem of graph-based user classification. We name our approach ARCTE. • Improved approximate, user-centric PageRank calculation for better local graph exploration. • Supervised community weighting step that boosts the importance of highly predictive communities in the feature representation. • Extensive comparative study of numerous state-of- the-art network feature extraction methods on several social interaction datasets. #24

    ×