Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Twitter Sub-event Detection Project Presentation

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 9 Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Twitter Sub-event Detection Project Presentation (20)

Aktuellste (20)

Anzeige

Twitter Sub-event Detection Project Presentation

  1. 1. Project : Sub-event detection on Social Media Codebase: https://github.com/pallavshah/TwitterSubeventDetector Pallav Shah Akshay Joshi Rajat Bhardwaj Ravneet Singh Kathuria
  2. 2. The Project • Make a timeline/summary of events from a corpus of tweets commenting on the event. • The corpus consists of tweets from a specific domain talking about a single major event. • The objective of the project is to extract sub-events within the event. • Summary will be short description about the sub event.
  3. 3. Our Approach We followed a two-step approach: • Sub-event Detection: The first step is to identify if and when a sub- event has occurred and if it has, what tweets comprise the sub-event • Tweet Selection: The second step is to choose a representative tweet that describes the sub-event appropriately. The aggregation of these two processes will in turn provide a set of tweets as a summary of the event.
  4. 4. Part1: Detecting the sub- event Sub-event detection is done by finding the distance measure between different tweets of same event. • Dictionary of words: The parsed data is used to create a dictionary which stores relevant words and its count in the corpus. • Vector for each tweet: The generated dictionary and a second parse over the parsed data are used to get a single sparse vector corresponding to each tweet. This vector contains the id and count of each word present in the tweet.
  5. 5. Part 1: Detecting the sub- event(continued) • The sub-event detector module:  The module uses LSHash Library of Python to find similarity distance between various tweets. Each tweet is analyzed and compared with the existing group of similar tweets. If the tweet matches to any of the group with a high threshold, the tweet is assumed to belong to that group and added to it. Otherwise, a new group is created with that tweet as the representative tweet of the group. In the end all the tweets as thus partitioned into groups (or clusters) representing different sub-events.
  6. 6. Part 2: Summarization of Sub- event • Term Frequency Inverse Document Frequency: A statistical weighting technique that assigns each term within a document a weight that reflects the term’s saliency within the document. The TF-IDF value is composed of two primary parts. The term frequency component (TF) assigns more weight to words that occur frequently within a document because important words are often repeated. The inverse document frequency component (IDF) compensates for the fact that some words such as common stop words are frequent. Normalization of tweets: The tweets are normalized to prevent bias towards larger tweets.
  7. 7. System Block Diagram
  8. 8. Technologies Used We have used the following python libraries: • LSHash: https://pypi.python.org/pypi/lshash/0.0.3dev • Gensim: http://radimrehurek.com/gensim/ Dataset We used Snow dataset containing tweets of 2012 US General Elections.
  9. 9. Experiments and Results • Tested on the 2012 US General Elections tweets data set from SNOW 2014. • Results bore around 60% accuracy as compared to manual evaluation of the tweets data.

×