2. Group Members
● Prateek Mehta (201203006)
prateek.mehta@students.iiit.ac.in
● Saurabh Kanaujia (201305551)
saurabh.kanaujia@students.iiit.ac.in
● Nikita Nataraj (201101079)
nikita.nataraj@students.iiit.ac.in
3. Aim
● Apply topic-modelling techniques to social
media.
● Main focus: reduce the cost of computing
LDA-model in social networks and this
technique should be scalable.
● Efficient representation and calculation of
topics of whole network.
4. Introduction
Topical categorization of blogs, documents or other objects that can be
tagged with text, improves the experience for end users. When the set of
documents is very large and varies significantly from user to user, the task of
calculating a single global topic model, or an individual topic model for each
and every user can become very expensive in large scale internet settings. In
order to implement topic modelling, we have used LDA. Latent Dirichlet
allocation (LDA)is an unsupervised, probabilistic, text clustering algorithm.
LDA defines a generative model that can be used to model how documents
are generated given a set of topics and the words in the topics. We have
chosen to LDA because it is more convenient to model more human like
corpus, in other other words social media.
5. Possible Approaches
1. Find LDA model for each user in network. (very costly)
2. Find top K influential users and apply LDA model for
these.
3. Classifying communities and apply the LDA model
across communities.
We tried to implement Approach 2 and 3.
6. Approach No. 3 Drawbacks
● This community detection is based upon bi-directional
follower-followee relationship. only 22-23% users in
twitter have such relationship where they follow each
other.
● Implementation to find communities based upon uni-
directional follower-followee relationship was not
possible and scalable.
7. Approach No. 2
Phase 1: Finding Influential Users
● Top-k users found using GraphChi API page rank
algorithm.
● Fetched tweets and URLs embedded with them.
Metadata, tags, ids are also fetched.
● Crawled the URLs, and summarized them.
● Tweets document + URI summary used as training data
9. Approach No. 2
Phase 2: User Similarity
● Tweets and urls are fetched. Url is summarised to 15-
20 sentences.
● Jaccard index is calculated to match user with one of
the top users.
● Maximum Jaccard index implies that user adopts the
topic distribution with the corresponding