Eswc2013 audience short

760 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie, Sport
  • Als Erste(r) kommentieren

Eswc2013 audience short

  1. 1. The Wisdom of the Audience: An Empirical Study ofSocial Semantics in Twitter StreamsClaudia Wagner, Philipp Singer, Lisa Posch and Markus Strohmaier10th Extended Semantic Web Conference, Montpellier, 29.5.2013
  2. 2. Problem#music#fashionAuthors make their messages as informative as required but do not providemore information than necessary (Maxim of Quantity by Grice (1975))[src: http://www.techweekeurope.co.uk/wp-content/uploads/2012/07/Twitter.jpg]
  3. 3. Research Questions3RQ 1: To what extent is the background knowledge of audiences useful foranalyzing the semantics of social media messages?RQ 2: What are the characteristics of an audience which possesses usefulbackground knowledge for interpreting the meaning of a streams messagesand which types of streams tend to have useful audiences?[scr: http://www.teachthought.com/twitter-hashtags-for-teacher/]
  4. 4. MethodologyMessage Classification TaskUse hashtags as ground truthLaniado and Mika (2010) showed that around half of all hashtags canbe associated with Freebase conceptsCompare real audience with random audience - how well can anaudience predict the hashtag of a tweet?The audience which is better in guessing the hashtag of a Twittermessage is better in interpreting the meaning of the messageNull hypothesis: If the audience of a stream does not possessmore knowledge about the semantics of the streams messagesthan a randomly selected baseline audience, the results fromboth classification models should not differ significantly4
  5. 5. MethodologyTrain different multiclass classifiers on the backgroundknowledge of the audienceLogistic Regression, Stochastic Gradient Descent, Multinomial NaiveBayes and Linear SVMCompare different approaches for estimating thebackground knowledgeDifferent audience and content selection approachesDifferent methods for estimating the background knowledgeTest how well each model can predict the hashtag offuture messagesWeighted Macro F15
  6. 6. DatasetDiverse sample of hashtagsRomero et al. (2011) identified eight categories ofhashtags on a large data samplecelebrity, games, idioms, movies/TV, music, political, sports, andtechnologyWe randomly draw from each category tenhashtags which were still in use6
  7. 7. Dataset7Technology Idioms Sports Politics#blackbery,#iphone, #google#omgfacts,#factsaboutme,#iwish#football, #nfl,#yankees#climate, #iran,#teapartyGames Music Celebrity Movies#gaming,#mafiawars,#wow#lastfm,#eurovision,#nowplaying#bsb,#michaeljackson,#rogis#avatar, #tv,#glennbeck
  8. 8. Dataset8t0 t1 t23/4/2012 4/1/2012 4/29/2012streamtweetscrawl ofsocialstructurestreamtweetscrawl ofsocialstructurestreamtweetscrawl ofsocialstructure1 weekcrawl of audiencetweetscrawl of audiencetweetscrawl of audiencetweetst1 t2 t3Stream Tweets 94,634 94,984 95,105Stream Authors 53,593 54,099 53,750Friends 7,312,792 7,896,758 8,390,143Audience Tweets 29,144,641 29,126,487 28,513,876
  9. 9. Audience SelectionABCAuthorsAudienceRank123StreamTeam bc tryouts tomo#footballWhat we learned thisweek: Chelsea areworking in reverseand Avram is coming#football #soccerWeekend pleeeeasehurrrrry #sanmarcos#footballHoly #ProBowl Imspent for the rest ofthe day. #footballFifa warns Indonesiato clean up its footballor face sanctions#Indonesia #Football
  10. 10. Background KnowledgeContent SelectionRecentThe most recent messages authored by theaudience usersTop Links (plain and enriched)the messages authored by the audience whichcontain one of the top links of that audienceTop Tagsthe messages authored by the audience whichcontain one of the top hashtags of that audience10
  11. 11. Background KnowlegdeRepresentationPreprocessing: remove stopwords, twittersyntax, stemmingRepresent background knowledge of the audiencevia the most likely topics or most important wordsof their messagesBag of Words: TF and TFIDFTopic Models: LDA11
  12. 12. Empirical EvaluationRQ 1: To what extent does the backgroundknowledge of the audience support the semanticannotation of individual messages?Combine audience selection and backgroundknowledge estimation approaches to generatesemantic features of the messages authored by anaudienceTraining data on audience’s messages crawled at t0Test model using messages of the hashtag streamscrawled at t112
  13. 13. Results13F1 (TF-IDF) F1 (LDA)Random Guessing 1/78 1/78Baseline (random audience) 0.01 0.01F1 (TF-IDF) F1 (LDA)Random Guessing 1/78 1/78Baseline (random audience) 0.01 0.01Audience - recent 0.25 0.23F1 (TF-IDF) F1 (LDA)Random Guessing 1/78 1/78Baseline (random audience) 0.01 0.01Audience – recent 0.25 0.23Audience – top links enriched 0.13 0.10Audience – top links plain 0.12 0.10F1 (TF-IDF) F1 (LDA)Random Guessing 1/78 1/78Baseline (random audience) 0.01 0.01Audience – recent 0.25 0.23Audience – top links enriched 0.13 0.10Audience – top links plain 0.12 0.10Audience – top tags 0.24 0.21The audience of a hashtag stream contains knowledgewhich is useful for predicting the hashtags of futuremessages
  14. 14. Results14F1 (TF-IDF) F1 (LDA)Random Guessing 1/78 1/78Baseline (random audience) 0.01 0.01F1 (TF-IDF) F1 (LDA)Random Guessing 1/78 1/78Baseline (random audience) 0.01 0.01Audience - recent 0.25 0.23F1 (TF-IDF) F1 (LDA)Random Guessing 1/78 1/78Baseline (random audience) 0.01 0.01Audience – recent 0.25 0.23Audience – top links enriched 0.13 0.10Audience – top links plain 0.12 0.10F1 (TF-IDF) F1 (LDA)celebrity 0.17 0.15games 0.25 0.22idioms 0.09 0.05movies 0.22 0.18music 0.23 0.18political 0.36 0.33sports 0.45 0.42technology 0.22 0.22
  15. 15. Empirical EvaluationRQ 2: What are the characteristics of anaudience which possesses useful backgroundknowledge for interpreting the meaning of astreams messages and which types of streamstend to have useful audiences?Correlation analysis between the ability of anaudience to interpret the meaning ofmessages and structural properties of thestream15
  16. 16. Structural Stream PropertiesStatic MeasuresCoverage: informational, hashtag, retweet andconversational extent of a streamEntropy: randomness of a streams authors and theirfollowers, followees and friendsOverlap: overlap between authors and followers,authors and followees and authors and friendsDynamic MeasuresKL divergence between the author-, the follower-, andthe friend-distributions of a stream at different timepoints16
  17. 17. Stat. Significant SpearmanRank Correlation (p<0.05)17F1 (TF-IDF) F1 (LDA)Overlap Author-Follower 0.675 0.655Overlap Author-Followee 0.642 0.628Overlap Author-Friend 0.612 0.602Streams which are produced and consumed by acommunity of users who are tightly interconnected tend tohave a useful audience.A useful audience possesses background knowledge whichhelps interpreting the meaning of messages.
  18. 18. Stat. Significant SpearmanRank Correlation (p<0.05)18F1 (TF-IDF) F1 (LDA)Conversation Coverage 0.256 0.256Conversational streams tend to have a usefulaudience.
  19. 19. Stat. Significant SpearmanRank Correlation (p<0.05)19F1 (TF-IDF) F1 (LDA)Entropy Author Distribution -0.270 -0.400Entropy Friend Distribution -0.307 -Entropy Follower Distribution -0.400 -0.319Entropy Followee Distribution -0.401 -0.368Streams which are produced and consumed by afocused set of authors, followers, followees andfriends tend to have a useful audience.
  20. 20. Stat. Significant SpearmanRank Correlation (p<0.05)20F1 (TF-IDF) F1 (LDA)KL Follower Distribution -0.281 -KL Followee Distribution -0.343 -0.302KL Author Distribution -0.359 -0.307Socially stable streams tend to have an audiencewhich is good in interpreting the meaning of astreams messages.
  21. 21. Summary & ConclusionsThe audience of a social stream possesses knowledge whichmay indeed help to interpret the meaning of a streamsmessagesBut not all streams have similar useful audiencesThe audience of a social stream seems to be most useful ifthe stream is created and consumed by a stable, focused andcommunicative community – i.e., a group of users who areinterconnected and have few core users to whom almosteveryone is connectedWe do not know if those relations are causal but we gotsimilar results when repeating our experiments on t1 and t221
  22. 22. Current and Future WorkCompare the utility of ontological knowledge withaudience background knowledge for the hashtagprediction taskAlgorithmic exploitation of our resultsHybrid hashtag recommendation algorithmStructural stream measures may inform weighting (how muchcan we count on the audience?)Differentiate between social and topical hashtagsUser-centric algorithms work only for active users who usedhashtags beforeAn audience-integrated approach only requires an active audience22
  23. 23. ReferencesGrice, H. P. (1975). Logic and conversation. In Speech acts, 3, 41–58. NewYork: Academic Press.Laniado, D., & Mika, P. (2010). Making sense of twitter. In Proceedings ofthe 9th international semantic web conference (pp. 470-485). Shanghai,China.Romero, D. M., Meeder, B., & Kleinberg, J. (2011). Differences in theme-chanics of information diffusion across topics: idioms, political hashtags,and complex contagion on twitter. In Proceedings of the 20th internationalconference on world wide web (pp. 695–704). Hyderabad, India.24
  24. 24. Experimental Setupsrc: http://adobeairstream.com/green/a-natural-predicament-sustainability-in-the-21st-century/THANK YOUclaudia.wagner@joanneum.athttp://claudiawagner.info[src: http://www.crowdscience.com/2008/06/tips_and_more/]

×