Group-13 Project 15 Sub event detection on social media

SUB EVENT DETECTION
ON SOCIAL MEDIA
Kshitij Kansal

Maaz Anwar Nomani

Ahmed Ali Durga
Information Retrieval and Extraction

INTRODUCTION
1. Motivation

• Social Media is ﬁlled with a lot of information.

• Information is shared much before the news gets displayed on the
news websites.

• The information shared captures even the minute details which
news websites might ovelook.

• This gives us a lot of scope for early news detection with more
diminutive details.

2. Objective

• We aim to propose an automatic method for extracting Sub -
Events in the given Social Media feeds.

SUB EVENT
• What is a Sub Event?

• Any kind of information which is small to be conveyed as a part of
whole event.

• large enough to affect some appreciably large reader's community.

• Includes aftermath of an event, real time notifications, responses,
public sentiments and reports.

• Why Sub Event?

• Closely related to a particular commuity.

• Can be used to enhance the knowledge of an event.

• Can measure the public sentiments along the whole course of
occurance of the event.

OUR EXPERIMENT
• Detecting the "sub events" in the Twitter Stream related to the
US Presedential Elections.
• Main Event: US Presidential Elections and the Victory of
Barack Obama.
• Sub Events: Victory or defeats of some famous candidates,
public sentiments across the course of elections, changes in the
stock market as the treds start to pour out etc.
• The approach decided is not speciﬁc to this dataset only. This
can be applied to any dataset in the form of Twitter stream.

APPROACH
We followed an organised approach where we divide the
whole process in the following three sub parts which
were dealt with separately and later integrated.

• Tweet filtering and Noise Reduction

• Sub Event Detection

• Sub Event Summarization

TWEET FILTERING AND NOISE REDUCTION
Aim: To eliminate the useless tweets which do not convey much
information regarding the event.
• Tweet Stream provided is cleaned using the self defined filter.
• Filter takes into account the linguistic aspects of the language and
context filtering.
• Remove Diacritic marks
• Consider only ASCII characters
• Ignore repeatitions
• Ignore Multiple Punctuations
• Consider only tweets starting with capitals
• Remove extremely small and large tweets

SUB EVENT EXTRACTION
Aim: To extract tweets that express some deﬁning moments in the
event.
• To be applied on the ﬁltered stream available from the noise
reduction module.
• Dictionary of the tweets words and generation of Tweet Vector
• Find the distance between the tweets.
• Group together the similar tweets.
• Chunks of relevant tweets will form the sub events.
• Hashing of the tweet stream to increse the speed of the system

EXTRACTION ...
Dictionary Creating and Vector Generation
• Dictionary Creation:
• Bag of Word Representation.
• Stop Word Removal.
• Assign unique ID to the words.
• Vector Generation:
• Create the n dimension vector
• n is the number of words in the dictionary.
• Vector value = 1, if word present
• Vector value = 0, if word not present
• Create sparse vector for space optimization.

EXTRACTION ...
Distance and Similarity Measures
• Euclidean Distance:
• Simple distance between the tweet vecors.
• Similar to ﬁnding distance between the points in n dimension space.
• n being the size of Tweet dictionary.
• Similarity Measure:
• Calculate the no of similar words in the tweets.
• If greter than some threshold, assume them to be similar
• Threshold(in our case): 50% of the length of smaller tweet.
• Takes into account the length of tweets i.e. Normalization.
• Cosine Similarity:
• Similar to above method.
• Also takes int account the length i.e. Normalization.
• Works by ﬁnding out the angle between the two tweets.
• Tweets are taken to points in n dimension space.

EXTRACTION ...
Hashing
• Increases the speed of retrieval module
• Locality Sensitive Hashing
• Dimension Reduction of high dimension data
• Maximizes the probability of collision of similar
tweets.
• PyLucene
• Python extension for using Java Lucene
• Apache Lucene is a free/open source
information retrieval software library

SUMMARIZATION
• Related tweets are extracted and stored in separated ﬁles.
• Need to make extract the sub event from these related tweets.
• Some kind of summarization of the colled tweets is required.
• Summarization needs to be in human readble form.
• Should able to convey the happeinings in the sub event.
• If possble, crawl data from the URL's in the links and use it for
summarization.
• Image support will increase its attractiveness and user
acceptability.

SUMMARIZATION ...
• Important for the end user evaluation.
• Thus,Summarization forms the crux of the content deﬁned by a
sub-event.
• Two approaches to automatic summarization
• Extraction: Works by selecting a subset of existing words,
phrases, or sentences in the original text to form the summary
• Abstraction: build an internal semantic representation and
then use natural language generation techniques to create a
summary that is closer to what a human might generate

SUMMARIZATION ...
• Spanning Phrase approach is used.
• Took into account the most frequent words in the
cluster of tweets and club them.
• Choose two to be the maximum frequency of a word is
'w' ccurring in all the tweets.

Group-13 Project 15 Sub event detection on social media

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Group-13 Project 15 Sub event detection on social media

Ähnlich wie Group-13 Project 15 Sub event detection on social media (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Group-13 Project 15 Sub event detection on social media