Learning to Classify Users in Online
Georgios Rizos, Symeon Papadopoulos, and Yiannis Kompatsiaris
Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI)
ICCSS 2015, June 10, 2015, Helsinki, Finland
Twitter Handle Labels
@nytimes usa, press,
Examples from SNOW 2014 dataset
User Classification in (and outside) OSNs
Network-based User Classification
• People with similar interests tend to connect
• Knowing about one’s connections
could reveal information
• Knowing about
the whole network
structure could reveal
Related Work: User Classification
Graph-based semi-supervised learning:
• Label propagation (Zhu and Ghahramani, 2002)
• Local and global consistency (Zhou et al., 2004)
• Empirical evaluation of many graph kernels (Fouss et al., 2012)
Other approaches to user classification:
• Hybrid feature engineering for inferring user behaviors
(Pennacchiotti et al., 2011 , Wagner et al., 2013)
• Crowdsourcing Twitter list keywords for popular users
(Ghosh et al., 2012)
• Content-based, graph-regularized NMF for spammer detection
(Hu et al., 2013)
Related Work: Graph Feature Extraction
First attempts at using community detection:
• EdgeCluster: Edge centric k-means (Tang and Liu, 2009)
• MROC: Binary tree community hierarchy (Wang et al., 2013)
Low-rank matrix representation methods:
• Laplacian Eigenmaps: k eigenvectors of the graph Laplacian
(Belkin and Niyogi, 2003 , Tang and Liu, 2011)
• Random-Walk Modularity Maximization: Does not suffer from
the resolution limit of ModMax (Devooght et al., 2014)
• Deepwalk: Deep representation learning (Perozzi et al., 2014)
Overview of Framework
Online social interactions
(retweets, mentions, etc.)
Network Features using ARCTE
• Based on user-centric community detection.
• We extract for each user, two types of user-centric
• Base user-centric community: 𝑐 𝑣 = 𝑁(𝑣) ∪ 𝑣
• Extended user-centric community: Consider a vector 𝑝 𝑣 that
contains similarity values among the seed user 𝑣 and all the
rest of the users.
– By truncating appropriately, we can keep a community of the most
similar users to the seed 𝑣.
– We keep the fewest possible users such that we still include the seed
user’s direct neighbors.
• Denote the set of communities detected by 𝐶. We form the
feature matrix 𝑋 as follows:
𝑥 𝑣𝑖 =
1, 𝑖𝑓𝑣 ∈ 𝑐𝑖
, ∀𝑐𝑖 ∈ 𝐶
ARCTE: Toy Example
Fast Approximate User-centric PageRank
• Given a seed user 𝑣, we calculate the user-centric PageRank
vector (i.e. stationary distribution with probability 1 at 𝑣).
• Localized, sparse vector; i.e. we neither propagate nor store
• Instead of approximating the PageRank vector, we
approximate cumulative PageRank differences. Better
approximation for fewer iterations.
• We alternate between two update rules:
– Cumulative PR diff: 𝑝(𝑡+1) = 𝑝(𝑡) + 1 − 𝜌 𝑟(𝑡−1) 𝑊𝑢
(instead of PR: 𝑝(𝑡+1) = 𝑝(𝑡) + 𝑟(𝑡) 𝐼 𝑢, (Andersen et al., 2006))
– Residual distribution: 𝑟(𝑡+1) = 𝑟(𝑡) − 𝑟(𝑡) 𝐼 𝑢 + (1 − 𝜌)𝑟(𝑡) 𝑊𝑢
where 𝜌: Restart probability and
𝑊𝑢 the 𝑢-th row of 𝑊 = 𝐷−1 𝐴 and 𝐼 𝑢 the 𝑢-th row of 𝐼
• Finally, we divide each element of 𝑝 by its degree in order to
get approximate, user-centric, regularized commute-times.
• We perform a supervised community weighting step to
boost the importance of highly predictive communities.
• For each community we calculate a weight:
𝑤 𝑑 = 𝜒2 𝑖 × 𝑖𝑣𝑓(𝑖)
• The first factor is based on supervised chi-squared weighting
that quantifies the correlation among all feature-label pairs.
– PSNR aggregation across labels: 𝜒2
𝑖,𝑙 −min( 𝜒2 𝑖,𝑙 )
• The second factor is unsupervised inverse vertex frequency.
– Consider idf with vertices as terms and communities as documents.
• We multiply each column of 𝑋 with the corresponding weight.
Evaluation: Dataset Description
Datasets Labels Vertices Vertex Type Edges Edge Type
(Papadopoulos et al., 2014)
90 533,874 Twitter
949,661 Mentions +
(Greene & Cunningham, 2013)
5 419 Twitter
11,349 Mentions +
(Mislove et al., 2007)
47 1,134,890 YouTube
(Tang and Liu, 2009)
195 80,513 Flickr Account 5,899,882 Contacts
Ground truth generation:
• SNOW2014 Graph: Twitter list aggregation & post-processing
• IRMV-PoliticsUK: Manual annotation
• ASU-YouTube: User membership to group
• ASU-Flickr: User subscription to interest group
Evaluation: SNOW 2014 dataset
SNOW2014 Graph (534K, 950K): Twitter mentions + retweets
ground truth based on Twitter list processing
Evaluation: Insight Politics UK
Insight-Multiview-PoliticsUK (419, 11K): mentions + retweets
ground truth based on manual annotation
ASU-YouTube (1.1M, 3M): YouTube subscriptions
ground truth based on membership to groups
ASU-Flickr (80K, 5.9M): Flickr contacts
ground truth based on membership to Flickr groups
Evaluation: Community Weighting
• Key ideas:
– new user feature representation based on user-centric
– community weighting based on sparse annotations
– consistently good performance both on interaction
(mention/retweet) and affiliation (follow/subscribe)
• Future Work:
– integration of additional signals (content)
– investigating feasibility on other classification problems,
e.g. spammer detection
• Get in touch:
@sympapadopoulos / firstname.lastname@example.org
@georgios_rizos / email@example.com
• Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction
and data representation. Neural computation, 15(6), 1373-1396.
• Tang, L., & Liu, H. (2011). Leveraging social media networks for classification. Data
Mining and Knowledge Discovery, 23(3), 447-478.
• Devooght, R., Mantrach, A., Kivimäki, I., Bersini, H., Jaimes, A., & Saerens, M.
(2014, April). Random walks based modularity: application to semi-supervised
learning. In Proceedings of the 23rd international conference on World wide web
(pp. 213-224). International World Wide Web Conferences Steering Committee.
• Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of
social representations. In Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 701-710). ACM.
• Tang, L., & Liu, H. (2009, November). Scalable learning of collective behavior based
on sparse social dimensions. In Proceedings of the 18th ACM conference on
Information and knowledge management (pp. 1107-1116). ACM.
• Wang, X., Tang, L., Liu, H., & Wang, L. (2013). Learning with multi-resolution
overlapping communities. Knowledge and information systems, 36(2), 517-535.
• Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label
propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University.
• Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Schölkopf, B. (2004). Learning with local and
global consistency. Advances in neural information processing systems, 16(16), 321-328.
• Fouss, F., Francoisse, K., Yen, L., Pirotte, A., & Saerens, M. (2012). An experimental
investigation of kernels on graphs for collaborative recommendation and semisupervised
classification. Neural Networks, 31, 53-72.
• Pennacchiotti, M., & Popescu, A. M. (2011, August). Democrats, republicans and starbucks
afficionados: user classification in twitter. In Proceedings of the 17th ACM SIGKDD
international conference on Knowledge discovery and data mining (pp. 430-438). ACM.
• Ghosh, S., Sharma, N., Benevenuto, F., Ganguly, N., & Gummadi, K. (2012, August). Cognos:
crowdsourcing search for topic experts in microblogs. In Proceedings of the 35th
international ACM SIGIR conference on Research and development in information retrieval
(pp. 575-590). ACM.
• Hu, X., Tang, J., Zhang, Y., & Liu, H. (2013, August). Social spammer detection in
microblogging. In Proceedings of the Twenty-Third international joint conference on Artificial
Intelligence (pp. 2633-2639). AAAI Press.
• Wagner, C., Asur, S., & Hailpern, J. (2013, September). Religious politicians and creative
photographers: Automatic user categorization in twitter. In Social Computing (SocialCom),
2013 International Conference on (pp. 303-310). IEEE.
• Andersen, R., Chung, F., & Lang, K. (2006, October). Local graph
partitioning using pagerank vectors. In Foundations of Computer Science,
2006. FOCS'06. 47th Annual IEEE Symposium on (pp. 475-486). IEEE.
• Papadopoulos, S., Corney, D., & Aiello, L. M. (2014). SNOW 2014 Data
Challenge: Assessing the Performance of News Topic Detection Methods
in Social Media. In SNOW-DC@ WWW (pp. 1-8).
• Greene, D., & Cunningham, P. (2013, May). Producing a unified graph
representation from multiple social network views. In Proceedings of the
5th Annual ACM Web Science Conference (pp. 118-121). ACM.
• Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P., & Bhattacharjee, B.
(2007, October). Measurement and analysis of online social networks. In
Proceedings of the 7th ACM SIGCOMM conference on Internet
measurement (pp. 29-42). ACM.
• Tang, L., & Liu, H. (2009, June). Relational learning via latent social
dimensions. In Proceedings of the 15th ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 817-826). ACM.
Classifying Users using Network Structure
• User-centric community detection to the problem
of graph-based user classification. We name our
• Improved approximate, user-centric PageRank
calculation for better local graph exploration.
• Supervised community weighting step that boosts
the importance of highly predictive communities in
the feature representation.
• Extensive comparative study of numerous state-of-
the-art network feature extraction methods on
several social interaction datasets.
Useful for news search/discovery
Potential privacy issues
Different kinds of user classification:
topic-oriented (e.g., interest/expertise)
role-based/behavioral (e.g., bot/spammer)
Useful for advertising,
expert search, etc.
For personal accounts,
user classification raises