SlideShare a Scribd company logo
1 of 13
Topic Modeling in Social
Media
Faculty Mentor: Dr. Vasudeva Varma
Student Mentor: Sandeep Panem
Kashyap Murthy Krishnakant Vishwakarma Sajal Sharma Vinil Narang
Topic Modeling
 A Topic Model is a statistical model for
discovering the abstract "topics" that
occur in a collection of documents.
LDA(Latent Dirichlet Allocation)
 Topic Modeling represents documents as mixtures of topics that spit out words with
certain probabilities.
 LDA is the most common topic model currently in use.
 LDA is used in social platforms to come up with a model which can predict on which
topics related document a user prefers to read and write.
LDA …
 LDA is a method which considers documents
beyond mere just bag of words.
 It describes a generative process whereby,
given a Dirichlet conditioned bag filled with
topic-distributions for each document, we
draw a topic mixture from this bag. Then, we
repeatedly draw both a topic and then a
word from that topic to generate the words
in that document.
 we have a generative model that represents
the process by which a corpora of
documents were created in terms of Topics.
Approaches for topic modeling
 Approach1: Finding LDA model for each user in the network.
 Approach2: Finding top K influential users and applying the LDA model on these users only.
 Approach3: Finding communities present in the network and approximating users topics
by applying LDA model of its corresponding community.
Comparing different approaches.
 Approach1:
 Calculating LDA for each user is very costly since the data set is very large.
 So this approach is not feasible for practical purposes.
Comparing different approaches.
 Approach 2:
 Since LDA can’t be applied to each user, therefore we have selected top k users on which we
have applied LDA.
 The top k users were found out using Page Rank algorithm.
 Then we extracted the tweets for these k users and applied LDA on them.
 For this project we followed approach 2.
Approach 2…
 1. Step 1:
 Input : A file containing the follower-followee
relationship in the form of edge list. It contains
~6.25 crore nodes(users) and ~146 crore
edges(follower-followee relationship)
 Output : A file containing the userid and
pageranks of all the nodes(users) in the graph.
 Approach : Used GraphChi’s pagerank
algorithm to perform the above mentioned
task.
 
 2. Step 2:
 Input : A file containing the pageranks and
userids of all the nodes(users) in the graph.
 Output : A file containing the pageranks and
userids of all the users sorted on the basis of
pagerank.
 Approach : External merge sort is used to
perform this kind of sorting because the size of
the file to be sorted was 30GB. 
Approach 2…
 3. Step 3:
 Input : A file containing the pageranks and userids of all the nodes sorted by pagerank.
 Output : A file containing the top 50 users(userids and pagerank)
 4. Step 4:
 Input : A file containing the top 50 users.
 Output : The tweets of these top 50 users.
 Approach : We used the tweet crawler to crawl the tweets of these users.
 5. Step 5:
 Input : 50 files containing tweets of top 50 users.
 Output : LDA models of these 50 users.
 Approach : Used Mallet to perform this task.
Approach 2…
Problems:
 We encountered a case where all the influential elements could belong to a single
community.
 Then LDA on the top k influential elements would not result in a generic model, hence this
technique is not robust and effective for this scenarios.
 Hence we had to move on to the next approach of community detection before applying
the LDA.
Approach 3… and future works :
 Since Approach 2 was not giving expected results, so we tried Community Detection
approach.
 The GraphChi Community Detection code is still in early phase of development and is not
giving expected results.
 There is another API named Infomap which tends to work well for small graphs but the code
burst for large graphs.
 We are still working on approach 3 and aiming for better accuracy and results.
Thank you 

More Related Content

What's hot

Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)Anil Shrestha
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics DomainDrjabez
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276IJMER
 
Sentimental Analysis - Naive Bayes Algorithm
Sentimental Analysis - Naive Bayes AlgorithmSentimental Analysis - Naive Bayes Algorithm
Sentimental Analysis - Naive Bayes AlgorithmKhushboo Gupta
 
Classification of Health Forum Messages using Deep Learning
Classification of Health Forum Messages using Deep LearningClassification of Health Forum Messages using Deep Learning
Classification of Health Forum Messages using Deep LearningSejal Naidu
 
Sentiment Analysis on Twitter Data
Sentiment Analysis on Twitter DataSentiment Analysis on Twitter Data
Sentiment Analysis on Twitter DataIRJET Journal
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using mlPravin Katiyar
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..butest
 
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATIONTHE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATIONijscai
 
Sentiment analysis of tweets using Neural Networks
Sentiment analysis of tweets using Neural NetworksSentiment analysis of tweets using Neural Networks
Sentiment analysis of tweets using Neural NetworksAdrián Palacios Corella
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
IRJET- Personality Recognition using Multi-Label Classification
IRJET- Personality Recognition using Multi-Label ClassificationIRJET- Personality Recognition using Multi-Label Classification
IRJET- Personality Recognition using Multi-Label ClassificationIRJET Journal
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATAParvathy Devaraj
 
IRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET Journal
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET Journal
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Paper id 71201913
Paper id 71201913Paper id 71201913
Paper id 71201913IJRAT
 

What's hot (20)

Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276
 
Sentimental Analysis - Naive Bayes Algorithm
Sentimental Analysis - Naive Bayes AlgorithmSentimental Analysis - Naive Bayes Algorithm
Sentimental Analysis - Naive Bayes Algorithm
 
Classification of Health Forum Messages using Deep Learning
Classification of Health Forum Messages using Deep LearningClassification of Health Forum Messages using Deep Learning
Classification of Health Forum Messages using Deep Learning
 
Sentiment Analysis on Twitter Data
Sentiment Analysis on Twitter DataSentiment Analysis on Twitter Data
Sentiment Analysis on Twitter Data
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using ml
 
Project report
Project reportProject report
Project report
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATIONTHE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
 
Sentiment analysis of tweets using Neural Networks
Sentiment analysis of tweets using Neural NetworksSentiment analysis of tweets using Neural Networks
Sentiment analysis of tweets using Neural Networks
 
Q01741118123
Q01741118123Q01741118123
Q01741118123
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Aj35198205
Aj35198205Aj35198205
Aj35198205
 
IRJET- Personality Recognition using Multi-Label Classification
IRJET- Personality Recognition using Multi-Label ClassificationIRJET- Personality Recognition using Multi-Label Classification
IRJET- Personality Recognition using Multi-Label Classification
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
 
IRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection System
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Paper id 71201913
Paper id 71201913Paper id 71201913
Paper id 71201913
 

Similar to Topic modeling

IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.ASHISH JAGTAP
 
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...Pieter Heyvaert
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documentssubash chandra
 
onlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdfonlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdfrohanthombre2
 
Online library management system
Online library management systemOnline library management system
Online library management systemBharat Kunwar
 
The “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInThe “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInKun Le
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
 
Doctoral seminar (DBIS RWTH Aachen)
Doctoral seminar  (DBIS RWTH Aachen)Doctoral seminar  (DBIS RWTH Aachen)
Doctoral seminar (DBIS RWTH Aachen)Zina Petrushyna
 
DevOps Support for an Ethical Software Development Life Cycle (SDLC)
DevOps Support for an Ethical Software Development Life Cycle (SDLC)DevOps Support for an Ethical Software Development Life Cycle (SDLC)
DevOps Support for an Ethical Software Development Life Cycle (SDLC)Mark Underwood
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016Saurabh Deochake
 
EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18Karthikeyan Rajasekharan
 
Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)ShankarPrasaadRajama
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEEMEMTECHSTUDENTPROJECTS
 
Selection of Tags for Tag Clouds
Selection of Tags for Tag CloudsSelection of Tags for Tag Clouds
Selection of Tags for Tag CloudsAakash Gupta
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ireSoham Saha
 

Similar to Topic modeling (20)

SocialLda
SocialLda SocialLda
SocialLda
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.
 
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documents
 
onlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdfonlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdf
 
Online library management system
Online library management systemOnline library management system
Online library management system
 
The “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInThe “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedIn
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Doctoral seminar (DBIS RWTH Aachen)
Doctoral seminar  (DBIS RWTH Aachen)Doctoral seminar  (DBIS RWTH Aachen)
Doctoral seminar (DBIS RWTH Aachen)
 
KDD Cup Research Paper
KDD Cup Research PaperKDD Cup Research Paper
KDD Cup Research Paper
 
DevOps Support for an Ethical Software Development Life Cycle (SDLC)
DevOps Support for an Ethical Software Development Life Cycle (SDLC)DevOps Support for an Ethical Software Development Life Cycle (SDLC)
DevOps Support for an Ethical Software Development Life Cycle (SDLC)
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016
 
EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18
 
Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)
 
50120140505004
5012014050500450120140505004
50120140505004
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
 
STACK OVERFLOW DATASET ANALYSIS
STACK OVERFLOW DATASET ANALYSISSTACK OVERFLOW DATASET ANALYSIS
STACK OVERFLOW DATASET ANALYSIS
 
Selection of Tags for Tag Clouds
Selection of Tags for Tag CloudsSelection of Tags for Tag Clouds
Selection of Tags for Tag Clouds
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ire
 
Computers
ComputersComputers
Computers
 

Recently uploaded

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 

Recently uploaded (20)

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 

Topic modeling

  • 1. Topic Modeling in Social Media Faculty Mentor: Dr. Vasudeva Varma Student Mentor: Sandeep Panem Kashyap Murthy Krishnakant Vishwakarma Sajal Sharma Vinil Narang
  • 2. Topic Modeling  A Topic Model is a statistical model for discovering the abstract "topics" that occur in a collection of documents.
  • 3. LDA(Latent Dirichlet Allocation)  Topic Modeling represents documents as mixtures of topics that spit out words with certain probabilities.  LDA is the most common topic model currently in use.  LDA is used in social platforms to come up with a model which can predict on which topics related document a user prefers to read and write.
  • 4. LDA …  LDA is a method which considers documents beyond mere just bag of words.  It describes a generative process whereby, given a Dirichlet conditioned bag filled with topic-distributions for each document, we draw a topic mixture from this bag. Then, we repeatedly draw both a topic and then a word from that topic to generate the words in that document.  we have a generative model that represents the process by which a corpora of documents were created in terms of Topics.
  • 5. Approaches for topic modeling  Approach1: Finding LDA model for each user in the network.  Approach2: Finding top K influential users and applying the LDA model on these users only.  Approach3: Finding communities present in the network and approximating users topics by applying LDA model of its corresponding community.
  • 6. Comparing different approaches.  Approach1:  Calculating LDA for each user is very costly since the data set is very large.  So this approach is not feasible for practical purposes.
  • 7. Comparing different approaches.  Approach 2:  Since LDA can’t be applied to each user, therefore we have selected top k users on which we have applied LDA.  The top k users were found out using Page Rank algorithm.  Then we extracted the tweets for these k users and applied LDA on them.  For this project we followed approach 2.
  • 8. Approach 2…  1. Step 1:  Input : A file containing the follower-followee relationship in the form of edge list. It contains ~6.25 crore nodes(users) and ~146 crore edges(follower-followee relationship)  Output : A file containing the userid and pageranks of all the nodes(users) in the graph.  Approach : Used GraphChi’s pagerank algorithm to perform the above mentioned task.    2. Step 2:  Input : A file containing the pageranks and userids of all the nodes(users) in the graph.  Output : A file containing the pageranks and userids of all the users sorted on the basis of pagerank.  Approach : External merge sort is used to perform this kind of sorting because the size of the file to be sorted was 30GB. 
  • 9. Approach 2…  3. Step 3:  Input : A file containing the pageranks and userids of all the nodes sorted by pagerank.  Output : A file containing the top 50 users(userids and pagerank)  4. Step 4:  Input : A file containing the top 50 users.  Output : The tweets of these top 50 users.  Approach : We used the tweet crawler to crawl the tweets of these users.  5. Step 5:  Input : 50 files containing tweets of top 50 users.  Output : LDA models of these 50 users.  Approach : Used Mallet to perform this task.
  • 11. Problems:  We encountered a case where all the influential elements could belong to a single community.  Then LDA on the top k influential elements would not result in a generic model, hence this technique is not robust and effective for this scenarios.  Hence we had to move on to the next approach of community detection before applying the LDA.
  • 12. Approach 3… and future works :  Since Approach 2 was not giving expected results, so we tried Community Detection approach.  The GraphChi Community Detection code is still in early phase of development and is not giving expected results.  There is another API named Infomap which tends to work well for small graphs but the code burst for large graphs.  We are still working on approach 3 and aiming for better accuracy and results.