Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Automatic generation of event summaries using microblog streams

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 25 Anzeige
Anzeige

Weitere Verwandte Inhalte

Andere mochten auch (20)

Ähnlich wie Automatic generation of event summaries using microblog streams (20)

Anzeige

Aktuellste (20)

Automatic generation of event summaries using microblog streams

  1. 1. “Twitsum” : Automatic generation of event summaries using microblog streams P.K.K.Madhawa 2012MCS044
  2. 2. Motivation - The problem with Twitter search ● Twitter ranks tweets based on user interaction with them. (number of retweets, favorites) ● Top results for the query ‘Ebola’ (25th November 2014) ● How to distinguish newsworthy tweets drowned in a sea of noise
  3. 3. Goal ● Distinguish newsworthy tweets based on syntactic features without depending on manual annotations ● Group tweets discussing the similar content together
  4. 4. Contributions ● A heuristic based scheme for annotating tweets as subjective/objective ● A classifier capable of detecting objective tweets using only the syntactic information of tweets ● An entity-centric tweet clustering algorithm
  5. 5. Twitter summarization - Earlier approaches Sub-event detection based methods ● Use of a Hidden Markov Model to detect sub-events during an American football match (D.Chakrabarti and K.Punera, 2011) ● Sub-event detection by identifying outlier peaks in the temporal distribution of tweets on a topic. (Zubiaga et al., 2012) Clustering based approaches ● A support platform for event detection using social intelligence (T.Baldwin, P. Cook and B.Han, 2012) ○ Tweets are filtered using manually selected keywords
  6. 6. Design ● Tweet storage - stores the set of tweets downloaded using streaming API ● Classifier - selection of objective tweets ● Summarizer - removes duplicates and clusters the tweets based on their similarity
  7. 7. Design - Objectivity detection ● Tweets are periodically downloaded by querying the public timeline using Streaming API ● Structure of a tweet object: tweet text, user name, created time, geo location, language code, favorite count, retweeted_status, retweet count
  8. 8. Data collection ● Training data annotated using a heuristic measure ● Objective - If the tweet is generated by a verified profile ● Subjective - Tweets containing at least a single emoticon or an emoji character
  9. 9. Preprocessing ● All emoticons and emoji characters are removed from the corpus ● User mentions are replaced with the tag ‘MENTION’ (eg: “@john said this” converts to “MENTION sad this”) ● Punctuation symbols including the pound(#) character are removed. ● Urls are replaced with the tag ‘URL’ (eg: http://t.co/12d3 converts to URL) ● Numbers in a tweet are replaced by the tag ‘NUMERIC’ ● Remove stop words
  10. 10. Feature extraction ● Tweets are tokenized using TweetNLP tokenizer (K. Gimpel, N. Schneider, and B. O’Connor, 2011) ● Words are stemmed using Porter stemmer ● Stemmed unigrams, bigrams converted to binary Tf-Idf values (with Laplace smoothing) ● binary feature - presence of slang words (using an external gazetteer) ● binary feature - presence of bad words ● Unigrams, bigrams and trigrams of POS tags as binary Tf-Idf values ● Average number of misspelled words ● Average number of all-capital words ● Average number of hashtags
  11. 11. Classifier selection ● A dataset of 6,000 tweets on Ebola is used to benchmark three classifiers (3,000 tweets from each class) ○ Support Vector Machines ○ Logistic Regression ○ Naive Bayes ● Classifiers trained on a random sample of 4800 tweets and remaining used as the test set. ● Classifier parameters are found using 10-fold cross validation
  12. 12. Classifier performance ● SVM was selected because it had higher recall than Logistic Regression ● A higher recall results in a larger fraction of newsworthy tweets being detected
  13. 13. Contribution from features ● Measured using ablation test ● Features divided into three sets WRD - unigram and bigrams LEX - all other lexical features
  14. 14. Selection of the POS-tagger ● NLTK POS tagger ● Stanford tagger with GATE twitter model (L. Derczynski et al., 2013) ● SENNA tagger (Ronan Collobert, 2011) - “deep” recurrent convolutional neural network based discriminant parser Eg:"Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t. co/92JfMm2LaN | http://t.co/NoFij4iACl #news" NLTK tagger: [('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'NNP'), ('Cured', 'NNP'), ('Dr', 'NNP'), ('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'NNP'), ('Be', 'NNP'), ('Released', 'NNP'), ('u2026', 'NNP'), ('|', 'NNP'), ('news', 'NN')]
  15. 15. Selection of the POS tagger... "Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t.co/92JfMm2LaN | http://t. co/NoFij4iACl #news" SENNA tagger: [('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'), ('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('u2026', 'JJ'), ('|', 'NN'), ('news', 'NN')] Stanford tagger with Gate twitter model: [('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NN'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'), ('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('u2026', '.'), ('|', ':'), ('news', 'NN')]
  16. 16. Results Data sets ● 1 million tweets containing the term ‘Ebola’ ● 22,250 tweets related to the fifth Sri Lanka vs India ODI cricket match held on 16th November (objective- 465, subjective- 878) ○ Filtered using terms “SLvIND”, “SLvsIND”, “INDvSL” and “INDvsSL”. ● 6,800 tweets related to the fourth Sri Lanka vs England ODI cricket match held on 7th December (objective- 215, subjective- 242) ○ Filtered using terms “SLvENG”, “SLvsENG”, “ENGvSL” and ENGvsSL”.
  17. 17. Gold standard data set ● A sample 500 tweets on the topic ‘ebola’ is annotated manually as objective or subjective (objective- 206, subjective- 294) ● Classifier scores on this data ● Errors: “RT @TheDailyEdge: UPDATE: Obama has reduced the US deficit by 70% and Ebola cases in the US by 100%.” It’s hard to judge the objectivity of such sentences only based on syntactical information.
  18. 18. Comparison with prior research ● Event related tweets detection with user type recognition (L.Silva, E.Rillof, 2013) ○ A set of 6,000 tweets on disease outbreaks manually labeled using Amazon Mechanical Turk ● Twitter Sentiment Classification using Distant Supervision (A.Go, R.Bhayani and L.huang, 2013) ○ An SVM model trained on syntactic features used for sentiment classification Classifier Precision Recall F1-score User type agnostic classifier 83.15 55.99 66.92 User type specific classifier 80.35 66.07 72.15 Features Accuracy Unigram + Bigram 81.6 Unigram + POS 81.9
  19. 19. Cross-domain applicability ● The classifier trained on Ebola tweets applied on cricket related tweets ● The classifier trained on SLvIndia match performed well on SLvEngland tweets well
  20. 20. Summarizer ● Duplicates and near-duplicate tweets are abundant due to Retweets and tweets generated by ‘Tweet’ buttons on news sites ● Removes duplicates in the objective tweets detected by the classifier ● Tweets discussing the same entities are clustered together
  21. 21. ● Objective tweets are stripped of following symbols ‘RT’, ‘@-mentions’ and punctuation ● Jaccard similarity of tokens used to detect duplicate tweets ● Two tweets are considered similar if their Jaccard similarity is greater than a threshold d Near-duplicate removal
  22. 22. Clustering ● The goal is to cluster tweets mentioning the same entities together Eg: “#Miami #News NYC Doc Free of Ebola: Sources: Dr. Craig Spencer, the physician being treated for Ebola at Belle... http://t.co/iXSUk4axVV” “#Ebola so the good doctor Craig Spencer will go home - well - the nurse too free to roam but lest we forget 3 countries still suffer deeply” ● Vectors of NER tags converted to Tf-Idf scores and cosine value is selected as the distance measure among two NER tag vectors ● DBSCAN is selected because the number of clusters is not required and it is capable of identifying arbitrary shaped clusters
  23. 23. Clustering - results ● SVM classifier trained on ebola-3000 data set is applied on a corpus of 24,038 unseen tweets retrieved on a single day (11-11-2014) ● 13,380 tweets detected as objective and 8,138 as duplicates among them. Clustering resulted in 332 clusters while 2751 tweets labeled as noise ● Clusters depend on the quality of Named Entity Recognizer Entities: ['Craig', 'Ebola', 'Patient', 'Spencer', 'US']
  24. 24. Clustering - discussion ● In contrast this tweet labeled as noise “‘#Ebola Ebola Outbreak: US Free of Virus After New York Doctor Craig Spencer Cleared - International Business Times UK” entities - ['Business', 'Craig', 'Ebola', 'Free', 'International', 'New', 'Outbreak', 'Spencer' 'Times', 'US' 'Virus' 'York']
  25. 25. Future work ● Improve cross-domain applicability ○ Finding better features with less dependence on the domain ● A better methodology to evaluate summaries ● Improve clustering to consider verbs also ● Generate an abstractive summary ○ Generate novel sentences from the information contained in tweets ● Generate summaries realtime

×