SlideShare ist ein Scribd-Unternehmen logo
1 von 69
Identification and Characterization of Events in Social Media 	Hila Becker, Thesis Defense
Social Media is Changing the World 2 Lady Gaga, Justin Bieber, and Britney Spears have more Twitter followers than the entire populations of some countries (e.g., Israel, Greece) YouTube is the second largest search engine in the world Every minute, 24 hours of video are uploaded to YouTube Over the past five years people uploaded 6,000,000,000 images to Flickr
3 Source: http://www.searchenginejournal.com/the-growth-of-social-media-an-infographic/
Event Content in Social Media 4
5 MIKE CLARKE/AFP/Getty Images
6 Source: Tweets from Tahrir, edited by Nadia Idle and Alex Nunns
7
Event Identification, Characterization, and Content Selection Identify events and their associated social media documents In a timely manner Across different social media sites Characterize events along different dimensions  Select high-quality, relevant, useful event documents 8
Event Content in Social Media Challenges: Wide variety of topics, not all related to events (e.g., personal status updates, every-day mundane conversations) Unconventional text: abbreviations, typos Large-scale, rapidly produced content Opportunities: Content generated in real-time, as events happen Rich context features (e.g., time, location) Users’ perspective 9
Event Content in Social Media 10 Timeliness Real-time Retrospective Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr[Chen and Roy CIKM’09] Unknown Content Discovery Organization of YouTube concert videos [Kennedy and Naaman WWW’09] Earthquake prediction using Twitter [Sakaki et al. WWW’10] Known
Event Content in Social Media 11 Trending Event isa real-world occurrence described by: One or more terms and a time period Volume of messages posted for the terms in the time period exceeds some expected level of activity Unknown Content Discovery Planned Event is a real-world occurrence with corresponding published event record consisting of: Title, describing the subject of the event The time at which the event is planned to occur Known
Contributions Trend (and trending event) study, for characterizing and differentiating between different types of trends Online clustering framework with an event classification step for identifying trending  events and their associated documents in social media Social media document similarity metric learning approaches Query formulation strategies for identifying social media documents for planned events Selection techniques for identifying high quality, relevant, and useful event content 12 Unknown Known Unknown/Known
Contributions Trend (and trending event) study, for characterizing and differentiating between different types of trends Online clustering framework with an event classification step for identifying trending  events and their associated documents in social media Social media document similarity metric learning approaches Query formulation strategies for identifying social media documents for planned events Selection techniques for identifying high quality, relevant, and useful event content 13 Known Unknown/Known
Contributions Trend (and trending event) study, for characterizing and differentiating between different types of trends Online clustering framework with an event classification step for identifying trending  events and their associated documents in social media Social media document similarity metric learning approaches Query formulation strategies for identifying social media documents for planned events Selection techniques for identifying high quality, relevant, and useful event content 14 Unknown/Known
Contributions Trend (and trending event) study, for characterizing and differentiating between different types of trends Online clustering framework with an event classification step for identifying trending  events and their associated documents in social media Social media document similarity metric learning approaches Query formulation strategies for identifying social media documents for planned events Selection techniques for identifying high quality, relevant, and useful event content 15
Identification and Characterization of Events in Social Media 16 Characterizationof trending events  Identification of trending events  Similarity metric learning for trending events Identification of content for planned events Selection of event content
What Types of Trends Exist in Social Media? Taxonomy of trends Characterization of each trend Manually assigned categories Automatically computed features Analysis of differences between trend types according to each characteristic 17 Trending Events Non-Event Trends
Trends Trend: One or more terms and a time period  Volume of messages posted for the terms in the time period exceeds some expected level of activity May or may not reflect a real-world occurrence A trending event is a type of trend 18
Twitter Content Streams of textual messages Brief content (140 characters) Communicated to network of followers Provide timely reflection of thoughts and interests 19
Characterizing Trends on Twitter Collect a set of Twitter trends Burst detection Twitter’s “trending topics” Qualitative analysis: trend taxonomy Quantitative analysis Automatically compute features of each trend and corresponding messages Manually label each trend according to categories introduced by the taxonomy Identify differences between trend categories according to automatically computed features 20
Affinity Diagram Method 21
Endogenous vs. Exogenous Trends Endogenous Trends: Twitter-centric activities that do not correspond to external events (e.g., a popular post by a celebrity) Exogenous Trends: trending eventsthat originated outside of the Twitter system (e.g., an earthquake) Do exogenous and endogenous trends exhibit different characteristics? 22
Characterization of Trends and Trending Events	 Automatically computed features Content Features Interaction Features Time-based Features Participation Features Social Network Features Compared differences between categories Hypotheses guided by differences in categories according to feature types Performed t-tests for significance analysis 23
Contributions of the Study Trends fall into two main categories: exogenous (i.e., trending event) and endogenous (i.e., platform-centric trend) There are significant differences between exogenous and endogenous trends Proportion of messages with URLs Unique hashtag in top 10% of messages Proportion of retweets Reciprocity 24
Identification and Characterization of Events in Social Media 25 Characterizationof trending events  Identification of trending events  Similarity metric learning for trending events Identification of content for planned events Selection of event content
Identifying Trending Events Event Clusters Documents Document Clusters 26
Identifying Trending Events in Real-Time Order documents by post time Use tf-idf vector representation of textual content Stop word elimination Stemming idf computed over past data Separate tweets by location Focus on tweets from NYC Different locations can be processed in parallel 27
Clustering Algorithm Many alternatives possible! [Berkhin 2002] Single-pass incremental clustering algorithm Scalable, online solution Using centroid representation  Used effectively for  Event identification in textual news [Allan et al. 1998] News event detection on Twitter [Sankaranarayanan et al. 2009] Does not require a priori knowledge of number of clusters Parameters: Similarity Function σ Threshold ÎŒ 28
Overview of Cluster-based Approach Group similar documents via online clustering Compute statistics of cluster content  Top terms (e.g., [earthquake, japan]) Number of documents per hour 
 Use cluster-level features to identify trendingeventclusters Single feature with threshold (e.g., increase in volume over time-window [Petrovićet al. 2010]) Trained classification model 29
Event Classification on Twitter Cluster-level features Social interaction  Topic coherence Trending behavior Platform-centric  Event classifier Human-annotated training data SVM model (selected during training phase) 30
Experimental Setup Classification accuracy Baseline: Naïve Bayes text classification (NB-Text) [Sankaranarayanan et al. 2009] 10-fold cross validation Blind test set of randomly chosen tweets Event surfacing: select top event clusters per hour Baselines Fastest-growing clusters per hour (Fastest) [Petrović et al. 2010] Randomly selected clusters per hour (Random) 5 hours, top-20 clusters per hour 31
Identified Events 32 A sample of events identified by our classifiers on the test set
Classification Performance (F-measure) RW-Event event classifier is more effective at discriminating between real-world events and rest of Twitter data 33
NDCG@K Evaluation 34 Performance of event classifier and baselines for event surfacing task.
Identification and Characterization of Events in Social Media 35 Characterizationof trending events  Identification of trending events  Similarity metric learning for trending events Identification of content for planned events Selection of event content
Social Media Document Representation Title Description Tags Date/Time Location All-Text 36 36
Social Media Document Similarity Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop word elimination?) 37 Title A A A B B B Description Time: proximity in minutes Tags time Date/Time Location: geo-coordinate proximity Location All-Text 37
Clustering Algorithm Many alternatives possible! [Berkhin 2002] Single-pass incremental clustering algorithm Scalable, online solution Using centroid representation  Used effectively for  Event identification in textual news [Allan et al. 1998] News event detection on Twitter [Sankaranarayanan et al. 2009] Does not require a priori knowledge of number of clusters Parameters: Similarity Function σ Threshold ÎŒ 38
Cluster Representation and Parameter Tuning Centroid cluster representation Average tf-idf scores Average time Geographic mid-point Parameter tuning in supervised training phase Clustering quality metrics to optimize: Normalized Mutual Information (NMI) [AmigĂł et al. 2008] B-Cubed [Strehl et al. 2002] 39
Learning a Similarity Metric for Clustering Ensemble-based similarity Training a cluster ensemble Computing a similarity score by: Combining individual partitions Combining individual similarities Classification-based similarity Training data sampling strategies Modeling strategies 40
Overview of a Cluster Ensemble Algorithm Ctitle Ensemble clustering solution Consensus Function: combine ensemble  similarities Wtitle f(C,W) Wtags Ctags Wtime Ctime Learned in a training step 41
Overview of a Cluster Ensemble Algorithm: Combining Partitions Wtitle Ctitle f(C,W) Wtags Ctags Wtime Ctime 42
Overview of a Cluster Ensemble Algorithm: Combining Similarities For each document di and cluster cj σCtitle(di,cj)>ÎŒCtitle Wtitle f(C,W) Wtags σCtags(di,cj)>ÎŒCtags Wtime σCtime(di,cj)>ÎŒCtime 43
Learning a Similarity Metric for Clustering Ensemble-based similarity Training a cluster ensemble Computing a similarity score by: Combining individual partitions Combining individual similarities Classification-based similarity Training data sampling strategies Modeling strategies 44
Classification-based Similarity Metrics Classify pairs of documents as similar/dissimilar Feature vector Pairwise similarity scores  One feature per similarity metric (e.g., time-proximity, location-proximity, 
) Modeling strategies Document pairs  Document-centroid pairs 45
Experiments: Alternative Similarity Metrics Ensemble-based techniques Combining individual partitions (ENS-PART) Combining individual similarities (ENS-SIM) Classification-based techniques Modeling: document-document vs. document-centroidpairs Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM) Baselines Title, Description, Tags, All-Text, Time-Proximity, Location-Proximity 46
Experimental Setup Datasets: Upcoming >270K Flickr photos Event labels from the “upcoming” event database (upcoming:event=12345) Split into 3 parts for training/validation/testing LastFM >594K Flickr photos Event labels from last.fm music catalog (lastfm:event=6789) Used as an additional test set 47
Clustering Accuracy over Upcoming Test Set All similarity learning techniques outperform the baselines Classification-based techniques perform better than ensemble-based techniques 48
NMI: Clustering Accuracy over Both Test Sets 		Upcoming				LastFM 49 NMI Similarity learning models trained on Upcoming data show similar trends when tested on LastFM data
Identification and Characterization of Events in Social Media 50 Characterizationof trending events  Identification of trending events  Similarity metric learning for trending events Identification of content for planned events Selection of event content
Identifying Content for Planned Events Identify planned event documents given known event information User-contributed planned event records LastFM Events EventBrite Facebook Events Structured features (e.g., title, time, location) Challenging identification scenario Known event information is often inaccurate or incomplete Social media documents are brief and noisy 51
Planned Event Record 52 Title Description Date/Time Venue City
Approach for Known Identification Scenario Two-step query formulation strategy Precision-oriented queries using known event features Recall-oriented queries using retrieved content from precision-oriented queries Leverage cross-site content Identify event documents on each site individually Use event documents on one site to retrieve additional event documents on a different site 53
Query Formulation Strategies Precision-oriented Queries:  Combined event record features Phrase, bag-of-words, stop word elimination Examples: [“title”+”venue”], [title-no-stopwords+”city”] Recall-oriented Queries Frequency Analysis Frequent terms in the event’s retrieved content Infrequently found in Web documents Term Extraction 54
Leveraging Cross-Site Content Build precision-oriented queries using planned eventfeatures Use precision-oriented queries to retrieve data from: Twitter Flickr YouTube Build recall-oriented queries using data from: Each site individually All sites collectively 55 [title+city] [title+venue] 
 tweet1 tweet2 tweetn photo1 photo2 photon video1 video2 videon
Experimental Settings 60 planned events from EventBrite, LastFM, LinkedIn, and Facebook Corresponding social media documents Retrieved from Twitter, Flickr, and YouTube Ranked according to similarity to event record Techniques Precision: only precision-oriented queries MS: precision- and recall-oriented queries selected using Microsoft n-gram probability score RTR: precision- and recall-oriented queries selected using ratio of document frequency around the time of the event to document frequency in larger time window 56
NDCG Performance on Twitter 57 NDCG scores for top-k Twitter documents retrieved by  Precision-oriented queries (Precision), and query strategies  using Twitter data (Twitter-RTR, Twitter-MS).
Cross-Site NDCG Performance 58 NDCG scores for top-k YouTube documents retrieved by  Precision-oriented queries (Precision), and query strategies  using data from Twitter (Twitter-MS) and YouTube (YouTube MS).
Identification and Characterization of Events in Social Media 59 Characterizationof trending events  Identification of trending events  Similarity metric learning for trending events Identification of content for planned events Selection of event content
Event Content Selection 60  Tiger Woods to make a public apology Friday and talk about his future in golf. Tiger woods y'all,tiger woods y'all,ah tiger woods y'all Tiger Woods Apology Tiger Woods Hugs: http://tinyurl.com/yhf4uzw Tiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jx Wedge wars upstage Watson v Woods: BBC Sport (blog)
Event Content Selection Challenges: Document clusters contain noise Relevant documents might have poor quality text Relevant, high quality documents might not be interesting For each document and a given event evaluate Quality Relevance Usefulness 61
Centrality Based Document Selection Centroid Cosine similarity of each document to cluster centroid Degree Documents are nodes Documents are connected if their similarity is above a threshold Compute degree centrality of each node LexRank[Erkan and Radev 2004] Same graph structure as Degree method  Central documents are similar to other central documents 62
Experimental Methodology: Content Selection 50 event clusters Randomly selected 5 top tweets per event for each: Centroid, Degree, LexRank Labeled on a 1-4 scale Quality: excellent (4) poor (1) Relevance: clearly relevant (4)  not relevant (1) Usefulness: clearly useful (4)  not useful (1) 63
Content Selection Results Average scores over all events (out of 4) High quality and relevance (>3) for both Degree and Centroid Centroid only method with high usefulness  64
Conclusions Techniques for identifying, characterizing, and selecting social media content for events There are significant differences between types of trends in social media, specifically trending events and non-event trends Trending events and their associated social media documents can be effectively identified using online clustering with: A classification step to separate event and non-event content Social media document similarity metrics for documents with rich context features A two-step query formulation technique is useful for identifying planned events across different social media sites  Centrality-based techniques can be used to select high quality, relevant, and useful social media event content 65
Future Work Clustering framework optimization Blocking techniques Topic models Identify unknown events with learned similarity metrics across sites Improve breadth of event content Rank events for search and presentation Extension of content selection techniques Learned ranking models 66
Publications Hila Becker, Dan Iter, MorNaaman, Luis Gravano, “Identifying Content for Planned Events Across Social Media Sites,” under submission. Hila Becker, Mor Naaman, Luis Gravano, “Beyond Trending Topics: Real-World Event Identification on Twitter,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper. Hila Becker, Mor Naaman, Luis Gravano, “Selecting Quality Twitter Content for Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper. Hila Becker, Feiyang Chen, Dan Iter, Mor Naaman, Luis Gravano, “Automatic Identification and Presentation of Twitter Content for Planned Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), demo paper. Mor Naaman, Hila Becker, Luis Gravano, “Hip and Trendy: Characterizing Emerging Trends on Twitter,” in Journal of the American Society for Information Science and Technology.  Hila Becker, MorNaaman, Luis Gravano, “Learning Similarity Metrics for Event Identification in Social Media,” in Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM '10), 291-300. Hila Becker, Bai Xiao, MorNaaman and Luis Gravano, “Exploiting Social Links for Event Identification in Social Media,” in Proceedings of the 3rd Annual Workshop on Search in Social Media (SSM '10), poster paper. Hila Becker, MorNaaman, Luis Gravano, “Event Identification in Social Media,” in Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB '09), 2009. 67
Thank You! 68
69

Weitere Àhnliche Inhalte

Andere mochten auch

Social Media Analysis: Present and Future
Social Media Analysis: Present and FutureSocial Media Analysis: Present and Future
Social Media Analysis: Present and Futurematthewhurst
 
Ppt thesis mora_rosa_sacapalanayeli
Ppt thesis mora_rosa_sacapalanayeliPpt thesis mora_rosa_sacapalanayeli
Ppt thesis mora_rosa_sacapalanayeliNayeli Sacapala
 
Using social media to impact student learning
Using social media to impact student learningUsing social media to impact student learning
Using social media to impact student learningLeonardo Ornellas Pena
 
Social Media and Adolescence
Social Media and Adolescence  Social Media and Adolescence
Social Media and Adolescence Katrina Wallace
 
Effects of Technological Device to Students
Effects of Technological Device to StudentsEffects of Technological Device to Students
Effects of Technological Device to StudentsKollins Lolong
 
Bands & Brands: A Guide to Experiential Activations at Music Festivals
Bands & Brands: A Guide to Experiential Activations at Music FestivalsBands & Brands: A Guide to Experiential Activations at Music Festivals
Bands & Brands: A Guide to Experiential Activations at Music FestivalsPBJS
 

Andere mochten auch (7)

Social Media Analysis: Present and Future
Social Media Analysis: Present and FutureSocial Media Analysis: Present and Future
Social Media Analysis: Present and Future
 
Social media & cityfm89
Social media & cityfm89Social media & cityfm89
Social media & cityfm89
 
Ppt thesis mora_rosa_sacapalanayeli
Ppt thesis mora_rosa_sacapalanayeliPpt thesis mora_rosa_sacapalanayeli
Ppt thesis mora_rosa_sacapalanayeli
 
Using social media to impact student learning
Using social media to impact student learningUsing social media to impact student learning
Using social media to impact student learning
 
Social Media and Adolescence
Social Media and Adolescence  Social Media and Adolescence
Social Media and Adolescence
 
Effects of Technological Device to Students
Effects of Technological Device to StudentsEffects of Technological Device to Students
Effects of Technological Device to Students
 
Bands & Brands: A Guide to Experiential Activations at Music Festivals
Bands & Brands: A Guide to Experiential Activations at Music FestivalsBands & Brands: A Guide to Experiential Activations at Music Festivals
Bands & Brands: A Guide to Experiential Activations at Music Festivals
 

Ähnlich wie Identification and Characterization of Events in Social Media

Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...
Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...
Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...Artificial Intelligence Institute at UofSC
 
Fusing text and image for event
Fusing text and image for eventFusing text and image for event
Fusing text and image for eventijma
 
Event detection in twitter using text and image fusion
Event detection in twitter using text and image fusionEvent detection in twitter using text and image fusion
Event detection in twitter using text and image fusioncsandit
 
Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Alfonso Crisci
 
Strategic perspectives 3
Strategic perspectives 3Strategic perspectives 3
Strategic perspectives 3archiejones4
 
From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?Yiannis Kompatsiaris
 
interacting with social media content about events
interacting with social media content about eventsinteracting with social media content about events
interacting with social media content about eventsmor
 
User Engagement - A Scientific Challenge
User Engagement - A Scientific ChallengeUser Engagement - A Scientific Challenge
User Engagement - A Scientific ChallengeMounia Lalmas-Roelleke
 
Social CI: A Work method and a tool for Competitive Intelligence Networking
Social CI: A Work method and a tool for Competitive Intelligence NetworkingSocial CI: A Work method and a tool for Competitive Intelligence Networking
Social CI: A Work method and a tool for Competitive Intelligence NetworkingComintelli
 
MITRE ATT&CKcon 2.0: AMITT - ATT&CK-based Standards for Misinformation Threat...
MITRE ATT&CKcon 2.0: AMITT - ATT&CK-based Standards for Misinformation Threat...MITRE ATT&CKcon 2.0: AMITT - ATT&CK-based Standards for Misinformation Threat...
MITRE ATT&CKcon 2.0: AMITT - ATT&CK-based Standards for Misinformation Threat...MITRE - ATT&CKcon
 
Eavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging SiteEavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging SiteShalin Hai-Jew
 
20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business School20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business Schoolimec.archive
 
Modelling the Media Logic of Software Systems
Modelling the Media Logic of Software SystemsModelling the Media Logic of Software Systems
Modelling the Media Logic of Software SystemsJan Schmidt
 
Secondary source qual
Secondary source qualSecondary source qual
Secondary source qualManikandan844955
 
A Systematic Survey on Detection of Extremism in Social Media
A Systematic Survey on Detection of Extremism in Social MediaA Systematic Survey on Detection of Extremism in Social Media
A Systematic Survey on Detection of Extremism in Social MediaRSIS International
 
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET Journal
 
Evolving social data mining and affective analysis
Evolving social data mining and affective analysis  Evolving social data mining and affective analysis
Evolving social data mining and affective analysis Athena Vakali
 
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYFRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYcscpconf
 

Ähnlich wie Identification and Characterization of Events in Social Media (20)

Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...
Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...
Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...
 
Fusing text and image for event
Fusing text and image for eventFusing text and image for event
Fusing text and image for event
 
Event detection in twitter using text and image fusion
Event detection in twitter using text and image fusionEvent detection in twitter using text and image fusion
Event detection in twitter using text and image fusion
 
Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...
 
Strategic perspectives 3
Strategic perspectives 3Strategic perspectives 3
Strategic perspectives 3
 
From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?
 
interacting with social media content about events
interacting with social media content about eventsinteracting with social media content about events
interacting with social media content about events
 
User Engagement - A Scientific Challenge
User Engagement - A Scientific ChallengeUser Engagement - A Scientific Challenge
User Engagement - A Scientific Challenge
 
Social CI: A Work method and a tool for Competitive Intelligence Networking
Social CI: A Work method and a tool for Competitive Intelligence NetworkingSocial CI: A Work method and a tool for Competitive Intelligence Networking
Social CI: A Work method and a tool for Competitive Intelligence Networking
 
MITRE ATT&CKcon 2.0: AMITT - ATT&CK-based Standards for Misinformation Threat...
MITRE ATT&CKcon 2.0: AMITT - ATT&CK-based Standards for Misinformation Threat...MITRE ATT&CKcon 2.0: AMITT - ATT&CK-based Standards for Misinformation Threat...
MITRE ATT&CKcon 2.0: AMITT - ATT&CK-based Standards for Misinformation Threat...
 
Eavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging SiteEavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging Site
 
Presentacion defensa marcelo_2018_v01
Presentacion defensa marcelo_2018_v01Presentacion defensa marcelo_2018_v01
Presentacion defensa marcelo_2018_v01
 
20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business School20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business School
 
Kushin (2018) review of Meltwater, Journal of Public Relations Education, Vol...
Kushin (2018) review of Meltwater, Journal of Public Relations Education, Vol...Kushin (2018) review of Meltwater, Journal of Public Relations Education, Vol...
Kushin (2018) review of Meltwater, Journal of Public Relations Education, Vol...
 
Modelling the Media Logic of Software Systems
Modelling the Media Logic of Software SystemsModelling the Media Logic of Software Systems
Modelling the Media Logic of Software Systems
 
Secondary source qual
Secondary source qualSecondary source qual
Secondary source qual
 
A Systematic Survey on Detection of Extremism in Social Media
A Systematic Survey on Detection of Extremism in Social MediaA Systematic Survey on Detection of Extremism in Social Media
A Systematic Survey on Detection of Extremism in Social Media
 
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
 
Evolving social data mining and affective analysis
Evolving social data mining and affective analysis  Evolving social data mining and affective analysis
Evolving social data mining and affective analysis
 
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYFRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
 

KĂŒrzlich hochgeladen

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

KĂŒrzlich hochgeladen (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Identification and Characterization of Events in Social Media

  • 1. Identification and Characterization of Events in Social Media Hila Becker, Thesis Defense
  • 2. Social Media is Changing the World 2 Lady Gaga, Justin Bieber, and Britney Spears have more Twitter followers than the entire populations of some countries (e.g., Israel, Greece) YouTube is the second largest search engine in the world Every minute, 24 hours of video are uploaded to YouTube Over the past five years people uploaded 6,000,000,000 images to Flickr
  • 4. Event Content in Social Media 4
  • 6. 6 Source: Tweets from Tahrir, edited by Nadia Idle and Alex Nunns
  • 7. 7
  • 8. Event Identification, Characterization, and Content Selection Identify events and their associated social media documents In a timely manner Across different social media sites Characterize events along different dimensions Select high-quality, relevant, useful event documents 8
  • 9. Event Content in Social Media Challenges: Wide variety of topics, not all related to events (e.g., personal status updates, every-day mundane conversations) Unconventional text: abbreviations, typos Large-scale, rapidly produced content Opportunities: Content generated in real-time, as events happen Rich context features (e.g., time, location) Users’ perspective 9
  • 10. Event Content in Social Media 10 Timeliness Real-time Retrospective Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr[Chen and Roy CIKM’09] Unknown Content Discovery Organization of YouTube concert videos [Kennedy and Naaman WWW’09] Earthquake prediction using Twitter [Sakaki et al. WWW’10] Known
  • 11. Event Content in Social Media 11 Trending Event isa real-world occurrence described by: One or more terms and a time period Volume of messages posted for the terms in the time period exceeds some expected level of activity Unknown Content Discovery Planned Event is a real-world occurrence with corresponding published event record consisting of: Title, describing the subject of the event The time at which the event is planned to occur Known
  • 12. Contributions Trend (and trending event) study, for characterizing and differentiating between different types of trends Online clustering framework with an event classification step for identifying trending events and their associated documents in social media Social media document similarity metric learning approaches Query formulation strategies for identifying social media documents for planned events Selection techniques for identifying high quality, relevant, and useful event content 12 Unknown Known Unknown/Known
  • 13. Contributions Trend (and trending event) study, for characterizing and differentiating between different types of trends Online clustering framework with an event classification step for identifying trending events and their associated documents in social media Social media document similarity metric learning approaches Query formulation strategies for identifying social media documents for planned events Selection techniques for identifying high quality, relevant, and useful event content 13 Known Unknown/Known
  • 14. Contributions Trend (and trending event) study, for characterizing and differentiating between different types of trends Online clustering framework with an event classification step for identifying trending events and their associated documents in social media Social media document similarity metric learning approaches Query formulation strategies for identifying social media documents for planned events Selection techniques for identifying high quality, relevant, and useful event content 14 Unknown/Known
  • 15. Contributions Trend (and trending event) study, for characterizing and differentiating between different types of trends Online clustering framework with an event classification step for identifying trending events and their associated documents in social media Social media document similarity metric learning approaches Query formulation strategies for identifying social media documents for planned events Selection techniques for identifying high quality, relevant, and useful event content 15
  • 16. Identification and Characterization of Events in Social Media 16 Characterizationof trending events Identification of trending events Similarity metric learning for trending events Identification of content for planned events Selection of event content
  • 17. What Types of Trends Exist in Social Media? Taxonomy of trends Characterization of each trend Manually assigned categories Automatically computed features Analysis of differences between trend types according to each characteristic 17 Trending Events Non-Event Trends
  • 18. Trends Trend: One or more terms and a time period Volume of messages posted for the terms in the time period exceeds some expected level of activity May or may not reflect a real-world occurrence A trending event is a type of trend 18
  • 19. Twitter Content Streams of textual messages Brief content (140 characters) Communicated to network of followers Provide timely reflection of thoughts and interests 19
  • 20. Characterizing Trends on Twitter Collect a set of Twitter trends Burst detection Twitter’s “trending topics” Qualitative analysis: trend taxonomy Quantitative analysis Automatically compute features of each trend and corresponding messages Manually label each trend according to categories introduced by the taxonomy Identify differences between trend categories according to automatically computed features 20
  • 22. Endogenous vs. Exogenous Trends Endogenous Trends: Twitter-centric activities that do not correspond to external events (e.g., a popular post by a celebrity) Exogenous Trends: trending eventsthat originated outside of the Twitter system (e.g., an earthquake) Do exogenous and endogenous trends exhibit different characteristics? 22
  • 23. Characterization of Trends and Trending Events Automatically computed features Content Features Interaction Features Time-based Features Participation Features Social Network Features Compared differences between categories Hypotheses guided by differences in categories according to feature types Performed t-tests for significance analysis 23
  • 24. Contributions of the Study Trends fall into two main categories: exogenous (i.e., trending event) and endogenous (i.e., platform-centric trend) There are significant differences between exogenous and endogenous trends Proportion of messages with URLs Unique hashtag in top 10% of messages Proportion of retweets Reciprocity 24
  • 25. Identification and Characterization of Events in Social Media 25 Characterizationof trending events Identification of trending events Similarity metric learning for trending events Identification of content for planned events Selection of event content
  • 26. Identifying Trending Events Event Clusters Documents Document Clusters 26
  • 27. Identifying Trending Events in Real-Time Order documents by post time Use tf-idf vector representation of textual content Stop word elimination Stemming idf computed over past data Separate tweets by location Focus on tweets from NYC Different locations can be processed in parallel 27
  • 28. Clustering Algorithm Many alternatives possible! [Berkhin 2002] Single-pass incremental clustering algorithm Scalable, online solution Using centroid representation Used effectively for Event identification in textual news [Allan et al. 1998] News event detection on Twitter [Sankaranarayanan et al. 2009] Does not require a priori knowledge of number of clusters Parameters: Similarity Function σ Threshold ÎŒ 28
  • 29. Overview of Cluster-based Approach Group similar documents via online clustering Compute statistics of cluster content Top terms (e.g., [earthquake, japan]) Number of documents per hour 
 Use cluster-level features to identify trendingeventclusters Single feature with threshold (e.g., increase in volume over time-window [Petrovićet al. 2010]) Trained classification model 29
  • 30. Event Classification on Twitter Cluster-level features Social interaction Topic coherence Trending behavior Platform-centric Event classifier Human-annotated training data SVM model (selected during training phase) 30
  • 31. Experimental Setup Classification accuracy Baseline: NaĂŻve Bayes text classification (NB-Text) [Sankaranarayanan et al. 2009] 10-fold cross validation Blind test set of randomly chosen tweets Event surfacing: select top event clusters per hour Baselines Fastest-growing clusters per hour (Fastest) [Petrović et al. 2010] Randomly selected clusters per hour (Random) 5 hours, top-20 clusters per hour 31
  • 32. Identified Events 32 A sample of events identified by our classifiers on the test set
  • 33. Classification Performance (F-measure) RW-Event event classifier is more effective at discriminating between real-world events and rest of Twitter data 33
  • 34. NDCG@K Evaluation 34 Performance of event classifier and baselines for event surfacing task.
  • 35. Identification and Characterization of Events in Social Media 35 Characterizationof trending events Identification of trending events Similarity metric learning for trending events Identification of content for planned events Selection of event content
  • 36. Social Media Document Representation Title Description Tags Date/Time Location All-Text 36 36
  • 37. Social Media Document Similarity Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop word elimination?) 37 Title A A A B B B Description Time: proximity in minutes Tags time Date/Time Location: geo-coordinate proximity Location All-Text 37
  • 38. Clustering Algorithm Many alternatives possible! [Berkhin 2002] Single-pass incremental clustering algorithm Scalable, online solution Using centroid representation Used effectively for Event identification in textual news [Allan et al. 1998] News event detection on Twitter [Sankaranarayanan et al. 2009] Does not require a priori knowledge of number of clusters Parameters: Similarity Function σ Threshold ÎŒ 38
  • 39. Cluster Representation and Parameter Tuning Centroid cluster representation Average tf-idf scores Average time Geographic mid-point Parameter tuning in supervised training phase Clustering quality metrics to optimize: Normalized Mutual Information (NMI) [AmigĂł et al. 2008] B-Cubed [Strehl et al. 2002] 39
  • 40. Learning a Similarity Metric for Clustering Ensemble-based similarity Training a cluster ensemble Computing a similarity score by: Combining individual partitions Combining individual similarities Classification-based similarity Training data sampling strategies Modeling strategies 40
  • 41. Overview of a Cluster Ensemble Algorithm Ctitle Ensemble clustering solution Consensus Function: combine ensemble similarities Wtitle f(C,W) Wtags Ctags Wtime Ctime Learned in a training step 41
  • 42. Overview of a Cluster Ensemble Algorithm: Combining Partitions Wtitle Ctitle f(C,W) Wtags Ctags Wtime Ctime 42
  • 43. Overview of a Cluster Ensemble Algorithm: Combining Similarities For each document di and cluster cj σCtitle(di,cj)>ÎŒCtitle Wtitle f(C,W) Wtags σCtags(di,cj)>ÎŒCtags Wtime σCtime(di,cj)>ÎŒCtime 43
  • 44. Learning a Similarity Metric for Clustering Ensemble-based similarity Training a cluster ensemble Computing a similarity score by: Combining individual partitions Combining individual similarities Classification-based similarity Training data sampling strategies Modeling strategies 44
  • 45. Classification-based Similarity Metrics Classify pairs of documents as similar/dissimilar Feature vector Pairwise similarity scores One feature per similarity metric (e.g., time-proximity, location-proximity, 
) Modeling strategies Document pairs Document-centroid pairs 45
  • 46. Experiments: Alternative Similarity Metrics Ensemble-based techniques Combining individual partitions (ENS-PART) Combining individual similarities (ENS-SIM) Classification-based techniques Modeling: document-document vs. document-centroidpairs Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM) Baselines Title, Description, Tags, All-Text, Time-Proximity, Location-Proximity 46
  • 47. Experimental Setup Datasets: Upcoming >270K Flickr photos Event labels from the “upcoming” event database (upcoming:event=12345) Split into 3 parts for training/validation/testing LastFM >594K Flickr photos Event labels from last.fm music catalog (lastfm:event=6789) Used as an additional test set 47
  • 48. Clustering Accuracy over Upcoming Test Set All similarity learning techniques outperform the baselines Classification-based techniques perform better than ensemble-based techniques 48
  • 49. NMI: Clustering Accuracy over Both Test Sets Upcoming LastFM 49 NMI Similarity learning models trained on Upcoming data show similar trends when tested on LastFM data
  • 50. Identification and Characterization of Events in Social Media 50 Characterizationof trending events Identification of trending events Similarity metric learning for trending events Identification of content for planned events Selection of event content
  • 51. Identifying Content for Planned Events Identify planned event documents given known event information User-contributed planned event records LastFM Events EventBrite Facebook Events Structured features (e.g., title, time, location) Challenging identification scenario Known event information is often inaccurate or incomplete Social media documents are brief and noisy 51
  • 52. Planned Event Record 52 Title Description Date/Time Venue City
  • 53. Approach for Known Identification Scenario Two-step query formulation strategy Precision-oriented queries using known event features Recall-oriented queries using retrieved content from precision-oriented queries Leverage cross-site content Identify event documents on each site individually Use event documents on one site to retrieve additional event documents on a different site 53
  • 54. Query Formulation Strategies Precision-oriented Queries: Combined event record features Phrase, bag-of-words, stop word elimination Examples: [“title”+”venue”], [title-no-stopwords+”city”] Recall-oriented Queries Frequency Analysis Frequent terms in the event’s retrieved content Infrequently found in Web documents Term Extraction 54
  • 55. Leveraging Cross-Site Content Build precision-oriented queries using planned eventfeatures Use precision-oriented queries to retrieve data from: Twitter Flickr YouTube Build recall-oriented queries using data from: Each site individually All sites collectively 55 [title+city] [title+venue] 
 tweet1 tweet2 tweetn photo1 photo2 photon video1 video2 videon
  • 56. Experimental Settings 60 planned events from EventBrite, LastFM, LinkedIn, and Facebook Corresponding social media documents Retrieved from Twitter, Flickr, and YouTube Ranked according to similarity to event record Techniques Precision: only precision-oriented queries MS: precision- and recall-oriented queries selected using Microsoft n-gram probability score RTR: precision- and recall-oriented queries selected using ratio of document frequency around the time of the event to document frequency in larger time window 56
  • 57. NDCG Performance on Twitter 57 NDCG scores for top-k Twitter documents retrieved by Precision-oriented queries (Precision), and query strategies using Twitter data (Twitter-RTR, Twitter-MS).
  • 58. Cross-Site NDCG Performance 58 NDCG scores for top-k YouTube documents retrieved by Precision-oriented queries (Precision), and query strategies using data from Twitter (Twitter-MS) and YouTube (YouTube MS).
  • 59. Identification and Characterization of Events in Social Media 59 Characterizationof trending events Identification of trending events Similarity metric learning for trending events Identification of content for planned events Selection of event content
  • 60. Event Content Selection 60 Tiger Woods to make a public apology Friday and talk about his future in golf. Tiger woods y'all,tiger woods y'all,ah tiger woods y'all Tiger Woods Apology Tiger Woods Hugs: http://tinyurl.com/yhf4uzw Tiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jx Wedge wars upstage Watson v Woods: BBC Sport (blog)
  • 61. Event Content Selection Challenges: Document clusters contain noise Relevant documents might have poor quality text Relevant, high quality documents might not be interesting For each document and a given event evaluate Quality Relevance Usefulness 61
  • 62. Centrality Based Document Selection Centroid Cosine similarity of each document to cluster centroid Degree Documents are nodes Documents are connected if their similarity is above a threshold Compute degree centrality of each node LexRank[Erkan and Radev 2004] Same graph structure as Degree method Central documents are similar to other central documents 62
  • 63. Experimental Methodology: Content Selection 50 event clusters Randomly selected 5 top tweets per event for each: Centroid, Degree, LexRank Labeled on a 1-4 scale Quality: excellent (4) poor (1) Relevance: clearly relevant (4)  not relevant (1) Usefulness: clearly useful (4)  not useful (1) 63
  • 64. Content Selection Results Average scores over all events (out of 4) High quality and relevance (>3) for both Degree and Centroid Centroid only method with high usefulness 64
  • 65. Conclusions Techniques for identifying, characterizing, and selecting social media content for events There are significant differences between types of trends in social media, specifically trending events and non-event trends Trending events and their associated social media documents can be effectively identified using online clustering with: A classification step to separate event and non-event content Social media document similarity metrics for documents with rich context features A two-step query formulation technique is useful for identifying planned events across different social media sites Centrality-based techniques can be used to select high quality, relevant, and useful social media event content 65
  • 66. Future Work Clustering framework optimization Blocking techniques Topic models Identify unknown events with learned similarity metrics across sites Improve breadth of event content Rank events for search and presentation Extension of content selection techniques Learned ranking models 66
  • 67. Publications Hila Becker, Dan Iter, MorNaaman, Luis Gravano, “Identifying Content for Planned Events Across Social Media Sites,” under submission. Hila Becker, Mor Naaman, Luis Gravano, “Beyond Trending Topics: Real-World Event Identification on Twitter,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper. Hila Becker, Mor Naaman, Luis Gravano, “Selecting Quality Twitter Content for Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper. Hila Becker, Feiyang Chen, Dan Iter, Mor Naaman, Luis Gravano, “Automatic Identification and Presentation of Twitter Content for Planned Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), demo paper. Mor Naaman, Hila Becker, Luis Gravano, “Hip and Trendy: Characterizing Emerging Trends on Twitter,” in Journal of the American Society for Information Science and Technology. Hila Becker, MorNaaman, Luis Gravano, “Learning Similarity Metrics for Event Identification in Social Media,” in Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM '10), 291-300. Hila Becker, Bai Xiao, MorNaaman and Luis Gravano, “Exploiting Social Links for Event Identification in Social Media,” in Proceedings of the 3rd Annual Workshop on Search in Social Media (SSM '10), poster paper. Hila Becker, MorNaaman, Luis Gravano, “Event Identification in Social Media,” in Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB '09), 2009. 67
  • 69. 69

Hinweis der Redaktion

  1. The problems we are trying to solve in this thesis
  2. Challenges and opportunities
  3. What’s been done in the space, very very very briefly, to introduce our known vs. unknown division
  4. Explain that we work in real-time (for the most part) and say we divide the space into unknown and know identification scenarios, then mention the type of even we focus on for each. Also briefly mention that as we discuss in the thesis, these are not disjoint
  5. Contributions in order, broken down into identification scenarios (more or less).
  6. Contributions in order, broken down into identification scenarios (more or less).
  7. Contributions in order, broken down into identification scenarios (more or less).
  8. Contributions in order, broken down into identification scenarios (more or less).
  9. Before we identify trending events, we asked ourselves what types of trending events exist in social media and how are they different from non-event trends that exhibit similar temporal behavior
  10. Over 200 million users
  11. Give a brief example of each feature type.
  12. These will help guide our features for event classification next

  13. Leader-follower?Dropping old clusters, merging clusters, etc. for future work
  14. NDCG is a precision-based metric that takes rank into account
  15. Bringing it back to point out the parameters
  16. This is an outline for the similarity metric learning discussion
  17. This is an outline for the similarity metric learning discussion
  18. Get the upcoming page for the event with the photo thumbnails , show the machine tag
  19. TAGS – ALMOST AS GOOD AS ALL-TEXT
  20. LexRank: method for extractive summ. Central nodes are connected to other central nodes, each node has centrality value that it distributes to connected nodes
  21. Play with thresholds!
  22. 
 just the ones that went into this thesis 