H. Purohit, Y. Ruan, A. Joshi, S. Parthasarathy, A. Sheth. Understanding User-Community Engagement by Multi-faceted Features: A Case Study on Twitter. in SoME 2011 (Workshop on Social Media Engagement, in conjunction with WWW 2011), March 29, 2011.
Paper: http://knoesis.org/library/resource.php?id=1095
More on Social Media @ Kno.e.sis at http://knoesis.org/research/semweb/projects/socialmedia/
Understanding User-Community Engagement by Multi-faceted Features: A Case Study on Twitter
1. Understanding User-Community Engagement by Multi-faceted Features: A Case Study on Twitter March 29, 2011 SoME 2011 (In Conjunction with WWW 2011) Hemant Purohit1, Yiye Ruan2, Amruta Joshi2, Srinivasan Parthasarthy2, Amit Sheth1 1Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis), Wright State University, USA 2Dept. of Computer Science & Engineering Ohio State University, USA
2. Outline (What) User-Community Engagement (Why) Motivation (How) Problem Formalization Approach Terminology Definition Analysis Framework People-Content-Network Analysis (PCNA) Experiments Datasets and Event Categorization Features Results Insights Conclusion & Future work 2
3. User-communityEngagement Multiple topics surrounding events being discussed on social media Each topic constitutes a community of users discussing about it e.g., Japan Earthquake community How do we understand the phenomenon of user participation (engagement) in topic discussions 3 Image: http://itcilo.wordpress.com
4. Motivation User Engagement Analysis Business How communities form during the product launch? What factors can attract users to engage in these communities, therefore further spreading the message? Crisis Management Effective communication: How quickly we can disseminate information between resource providers and people in need of resources? 4
5. Problem formalization: Approach User engagement has been studied in many forms: Community Formation & Detection, Information Propagation, Link Prediction etc. It involves a three-dimensional dynamic at play: Content: topic of interest, People: participants who engages in discussion about the topic, and Network: community structure formed around the topic discussion Rather than limiting to one dimension, we propose multidimensional approach Case Study on Twitter 5
6. Earlier Approaches 6 Content: Topic of Interest OR People: Participant of the discussion OR Network: Community around topic Images: tupper-lake.com/.../uploads/Community.jpg http://www.iconarchive.com/show/people-icons-by-aha-soft/user-icon.html
7. Our Approach 7 Content: Topic of Interest AND People: Participant of the discussion AND Network: Community around topic Images: tupper-lake.com/.../uploads/Community.jpg http://www.iconarchive.com/show/people-icons-by-aha-soft/user-icon.html
8. Problem formalization: Terminology Event-Oriented Community An implicit group of social network users who have joined discussion (by message posting) on topic about an event. Slice Collection of messages relevant to topic of discussion, posted during a fixed-length time window. Snapshot State of the network at a certain point of time at which user profile and connection information are crawled. Active Window: freshness matters! Active Community 8
9. Problem formalization: Definition Binary classification problem for user-community link prediction. User Engagement Prediction Problem: Given 1) an event-oriented community C formed around a topic of discussion; 2) a Twitter user U ε C, Predict whether U will be engaged in C (by composing a new tweet or retweeting an existing tweet which contains keywords or hashtag related to C's underlying event) in a future slice. If so, U is said to be a positive record. Otherwise, it is a negative record. 9
11. Experiments: Dataset & Event categorization Study on Twitter data Events have various characteristics and we hypothesize the user engagement analysis for them being affected by different variables No standard event categorization is available, so we categorize the events observing data over a time, as follows: Global (G) vs. Local (L)[e.g., Japan Earthquake vs. Iowa State Fair] Deterministic (D) vs. Unexpected (U)[e.g., Emmy Awards vs. Japan Earthquake] Compact (C) vs. Loose (Ls)[e.g., ISWC conference vs. Japan Earthquake] Transient (T) vs. Lasting (Lt)[e.g., President’s Speech vs. Egypt Revolution] 11
12. ClevelandShowPremiere: Second Season premiere of animated TV series Cleveland Show. September 26. Global, loose, deterministic, transient. DiscoveryBuildingCrisis: Hostage crisis at the head- quarters of Discovery Channel, Maryland. September 1. Local, loose, unexpected, transient. EmmyAwards: 62nd Prime-time Emmy Awards. August 29. Global, loose, deterministic, lasting. GoogleInstantSearch: Launch of Google Instant in United States. September 8. Global, loose, unexpected, transient. HeismanTrophy: Reggie Bush’s announcement to forfeit 2005 Heisman Trophy. September 14. Local, compact, unexpected, lasting. IowaStateFair: Iowa State Fair. August 12-22. Local, loose, deterministic, lasting. JewishNewYear: Jewish New Year 5771. September 8-10. Global, compact, deterministic, transient. LindsayLohanHearing: LindsayLohan’s hearing on probation revocation and verdict. September 24. Local, loose, deterministic, transient. LinuxCon: Annual convention organized by Linux Foundation. August 10-12. Global, compact, deterministic, lasting. LondonTubeStrike: London tube strike. September 6. Local, loose, deterministic, transient. RichCroninDeath: Death of singer and songwriter Rich Cronin. September 8. Local, loose, unexpected, transient. ScottPilgrimRelease: Release of movie Scott Pilgrim vs. the World. Aug 13. Global, loose, deterministic, lasting. SESSanFrancisco: Search Engine Strategies 2010 at San Francisco. August 16-20. Global, compact, deterministic, lasting. StuxnetWorm: Confirmation of Stuxnet worm at- tack on Iranian nuclear program. September 24. Global, loose, unexpected, lasting. 12 Events (labeled as per our categorization)
13. Experiments: Features Organized in the PCNA framework: Node/Author features (P), Content features (C), Community features (N) Extracted for each potential community member (U) in each slice, where U belongs to the union of follower lists of each active community member 13 Followee of (U) Whole Topic Community Potential new Member (U) EDGE: Active Community A B If B follows A
14.
15. Experiments: Features (cont.) Content features[Characteristics of tweets posted by active friends of U]: keywords: number of event-relevant keywords hashtags: number of event-relevant hashtags retweet: number of retweets mention: number of mentions url: number of relevancy-adjust hyperlinks Irrelevant hyperlink is given number -1 subjectivity: Subjectivity scores for words and emoticons Linguistic Cues (LIWC1 analysis): Features for the language usage. Top-3 transformed features using Principle Component Analysis (PCA) extracted 15 1http://www.liwc.net
16. Wait a minute! Not all contents have been viewed! Novelty and Attention: User is likely to see new or recent content/tweet and then join the community Apply temporal weighting on the features Dataset imbalance: too many negative records! Alleviated by SMOTE method Over-sampling on positive records and under-sampling on negative ones Not all users are active! Apply weighting on activity level based on last activity[1] 16 [1] Future works
17. Experiments We run the following experiment groups: allFeatures (All): contains all three feature groups onlyContent (Con.): contains only content feature onlyAuthor (Aut.): contains only author feature onlyCommunity (Com.): contains only community feature SVM classifier LibSVM, RBF Kernel, gamma=8, c=32 17
18. Experiments: Results 18 Event-Type Summary of Prediction Accuracy (%) Statistical significant results are in bold
19. Insights Performance of onlyCommunity classifiers is worst The latent nature of network features makes it difficult to be perceived by a user directly. The onlyContent classifiers give the best performance over other single feature groups Some users end up participating in a discussion based on observing the information from the public timeline, and therefore, these ad-hoc users are hard to observe via network analysis only. Content is engaging by its quality and nature (information sharing or call for an action or crowd sourcing). For example, link to an image or video (an evidential content) about Reggie Bush's surrender of Heisman Trophy in September, 2010 is likely to provoke lot more thoughts in a user's mind to engage in the discussion. 19
20. Insights (Cont.) Comparable performance of onlyAuthor classifiers as onlyContent classifiers for some of the topics Impact of the effective presence of influential people in the discussion group Insufficiency in content features, reflected by low average connectivity, can be compensated by author features (e.g., Rich Cronin Death). Statistical significance testing method shows allFeatures classifiers have better or equivalent performance over any single feature group classifier for 12 out of 14 topics The advantage of using all features is dominant, where degree of randomness in individual dimensions can be really high (e.g., Discovery Building attack). 20
21. Insights (Cont.) No significant correlation between selection of feature groups and the event types: lasting vs. transient. Possibility of the shift in the characteristics over time Advantage of allFeatures over other factor groups is generally stronger on the unexpected topics than the deterministic ones. Degree of randomness being high in discussions surrounding unexpected events 21
22. Conclusion & Future Work Every dimension (People, Content, Network) cannot be expected to perform well in all types of topic discussions, and hence, a strong need can be felt to study dynamics of user engagement by using the PCNA framework. Experiments with a more refined event types taxonomy and user engagement factors, with consideration of shift in the event characteristics over time Semantic Analysis of content to enhance content features Experiment on other social networks: Forums, DBLP 22
23. Questions? Paper at: http://knoesis.org/library/resource.php?id=1095 More on Social Media @ Kno.e.sis at http://knoesis.org/research/semweb/projects/socialmedia/ 23
Hinweis der Redaktion
Active window: due to concept of ‘novelty and attention’ in today’s social media systems’ design ---- (reason is information overload for a user to keep up with!!)
Observe the data of active window Find the followers base of the active community members as the samples for the prediction problem Analyze features for them, and classify, whether the follower U is going to join the community C
Global (G) vs. Local (L) – Scale (how many people will it attract)[e.g., Japan Earthquake vs. Iowa State Fair]Deterministic (D) vs. Unexpected (U) -- (expected)[e.g., Emmy Awards vs. Japan Earthquake]Compact (C) vs. Loose (Ls) -- (If people already know each other, so they are connected tightly)[e.g., ISWC conference vs. Japan Earthquake]Transient (T) vs. Lasting (Lt) -- (Event lasts for just a small time vs. long time)[e.g., President’s Speech vs. Egypt Revolution]
Essentially 3 major types for features:For Community surrounding the event size growth rateFor usersUsers that you follow, mention or retweet fromUsers that share a large overlap of friends with youUsers that share similar profilesFor tweet content[only from ‘followees’ of new users joining the community]url, mentions, RT, # Linguistics analysis features
Using style of writing for authorhashtags usage affects -- attention