video link => http://youtu.be/D9PBX8FmtpQ
Tweets Classifier which categorises tweets into these 6 categories:
Business
Politics
Music
Health
Sports
Technology
2. INTRODUCTION
Tweet Classification model categorizes the input tweets into one of the genres like
politics, sports, music, technology, health and business.
Model was trained from a set of predefined tweets.
Based on this training model, the classifier makes decision regarding which class
the test input belongs to.
3. APPROACHES
•First challenge was to collect a proper set of tweets which was going to be
utilized for training the model.
• Next step was to identify a set of keywords for each category based on which
tweets were fetched.
Two Approaches were used:
1) Naive Baye’s
2) SVM (Support Vector Machine)
Relative comparison of performance of both Algorithms.
4. NAÏVE BAYE’S MODEL
• A high dimensional dense vector for each tweet is constructed.
• Vector is constructed using each unique word of training tweets.
• Each word is treated as an independent feature.
• These features are treated as independent of each other and they contribute equally
in classification of any tweet.
5. SUPPORT VECTOR MACHINE
• A high dimensional dense vector is constructed for input tweet.
• Multiclass variant of SVM model was created for having multi-class classification.
Feature Selection
Here each word in the tweet is taken as independent feature which contributes in
the decision of classifying the tweet into any class.
We are using Unigram approach in this techique.
Tools/libraries used
LIBSVM : Used to scale train and test file.
WEKA : Used for implementing Naive Bayes classification.
6. Over Fitting issues
There is high probability that this classification model will be highly biased
towards its training set data. So the impact on the classification is one particular
tweet will be classified in its correct class because words used in were present in
training set but tweet with similar meaning but containing different set of words
might not be classified in the same class.
8. EXPERIMENTS AND RESULTS
•The model has been experimented with a certain amount of test data separated
from the training data. The model, in turn, was verified for accuracy levels.
•The final result is the graph / chart categorizing the user tweets on various genres.
9. Tweet : microsoft 's cortana assistant personalization comes to bing on the web
Result : Technology Class (Naïve Bayes Model)
10. Tweet : Lady Gaga released a new album
Result : Music Class (SVM model)
11. CONCLUSION
Using the above described approaches(SVM and Naïve Bayes) tweets are
classified into their respective categories with a very little percentage of error.
12. REFERENCES
•A Machine Learning Approach to Twitter User Classification by Marco
Pennacchiotti and Ana-Maria Popescu
http://coitweb.uncc.edu/~anraja/courses/SMS/SMSBib/2886-14198-1-PB.pdf
•Short Text Classification in Twitter to Improve Information Filtering by Bharath
Sriram, David Fuhry, Engin Demir, Hakan Ferhatosmanoglu
http://www.cs.bilkent.edu.tr/~hakan/publication/TweetClassification.pdf
•Twitter Trending Topic Classification by Kathy Lee, Diana Palsetia, Ramanathan
Narayanan, Md. Mostofa Ali Patwary, Ankit Agrawal, and Alok Choudhary
http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
•Analysis and Classication of Twitter messages by Christopher Horn
http://know-center.tugraz.at/wp-content/uploads/2010/12/Master-Thesis-
Christopher-Horn.pdf