2. Automated text classification has been considered as a vital method to manage
and process a vast amount of documents in digital forms that are widespread and
continuously increasing. In general, text classification plays an important role in
information extraction and summarization, text retrieval, and question answering.
Objective:
Create an efficient Support-vector machines model for text
classification/categorization
Measure its performance
Problem Statement
3. Text classification (text categorization): Assign documents to one or more predefined categories
Applications of Text Classification
Organize web pages into hierarchies
Domain-specific information extraction
Sort email in to different folders
Find interests of users
Common Methods
Manual classification
Automatic document classification
Supervised learning of document-label assignment function
Naive Bayes (simple, common method)
k-Nearest Neighbors (simple, powerful)
Support-vector machines (new, more powerful) and many more
Introduction
Sport
Science
Theory
Art
4. Examples
Labels may be domain-specific binary
e.g., "interesting-to-me" : "not-interesting-to-me”
e.g., “spam” : “not-spam”
e.g., “contains adult language” :“doesn’t”
LABELS=TOPICS
“finance” / “sports” / “asia”
Given:
A description of an instance, xX, where X is the instance language or instance space.
E.g: how to represent text documents.
A fixed set of categories C = {c1, c2,…, cn}
Determine:
The category of x: c(x)C, where c(x) is a categorization function whose domain is X and
whose range is C.
LABELS=OPINION
“like” / “hate” / “neutral”
LABELS=AUTHOR
“Shakespeare” / “Marlowe” / “Ben Jonson”
Labels may be genres
e.g., "editorials" "movie-reviews" "news“
Assign labels to each document or web-page
5. Decision Tree model
Decision Tree (DT):
Tree where the root and each internal node is labeled with a question.
The arcs represent each possible answer to the associated question.
Each leaf node represents a prediction of a solution to the problem.
Popular technique for classification; Leaf node indicates class to which the corresponding
tuple belongs.
A Decision Tree Model is a computational model consisting of three parts:
Decision Tree
Algorithm to create the tree
Algorithm that applies the tree to data
Creation of the tree is the most difficult part.
Processing is basically a search similar to that in a binary search tree (although DT may not
be binary).
The decision tree approach to classification is to divide the search space into rectangular
region. A tuple is classified based on the region into which it falls.
6. Naive Bayes Algorithm
Formula
Naive Bayes algorithm works on conditional probability i.e.
Where p(Ck|x) – Is the probability whethere the
tweet has positive/negative sentiment
P(Ck) – Probability of Negative/Positive
dataframe
P(x|Ck) – Probability of every word in tweet as
positive or negative
Where K – positive negative
P(xi|Ck) – probability of bag of words – P(x1,x2,x3,x4)
Sentiment with highest probability value will be selected
7. Logic behind the Model
•Say suppose we’ve trained the model using a excel file containing 10 tweets which consist of
3 positive tweets and 7 negative tweets.
• Probability (Positive tweets) = 0.3 Probability (Negative tweets) =0.7
• Say suppose out tweet is “I had an awesome experience”.
Say suppose the strings in this particular tweet are represented by x1, x2, x3, x4, x5.
Probability (Pos/strings of data(say x1 x2 x3 x4 x5)) =
P(Pos)*P(x1/pos)*P(x2/pos)*P(x3/pos)*P(x4/pos)*P(x5/pos) ----------------------- 1
Probability (Neg/strings of data(say x1 x2 x3 x4 x5)) =
P(Neg)*P(x1/neg)*P(x2/neg)*P(x3/neg)*P(x4/neg)*P(x5/neg) --------------------- 2
Where
If 1 > 2, the text will be classified as a positive one and if otherwise, negative tweet.
Where Nk – No of time x1 repeated in positive dataframe repository
N – Total number of words in positive dataframe repository including redundancy
D – Total distinct words including positive & negative database repository
8. 8
Support Vector Machines
Main idea of SVMs
Find out the linear separating hyperplane which
maximize the margin, i.e., the optimal separating
hyperplane (OSH)
Supervised learning
Support vector machines are based on the Structural Risk Minimization principle from
computational learning theory. The idea of structural risk minimization is to find a hypothesis h for
which we can guarantee the lowest true error.
Why Should SVMs Work Well for Text Categorization ?
• High dimensional input space
• Document vectors are sparse
• Few irrelevant features
• Most text categorization problems are linearly separable
9. Methodology
Documents
Preprocessing
Indexing and
Feature
selection
Applying SVM
classification
algorithm
Performance
measure
Transform documents into a suitable representation for
classification task
• Remove HTML or other tags
• Remove stopwords
• Perform word stemming
Indexing by different weighing schemes:
• Boolean weighing
• Word frequency weighing
Feature selection: Remove non-informative terms from
documents
• improve classification effectiveness
• reduce computational complexity
• K-Nearest-Neighbor algorithm (KNN)
• Decision Tree algorithm (DT)
• Naive Bayes algorithm (NB)
• Support Vector Machine (SVM)
Performance of algorithm:
– Training time
– Testing time
– Classification accuracy
10. Each document is a vector, one component for each term (= word).
Normalize to unit length.
High-dimensional vector space:
Terms are axes
10,000+ dimensions, or even 100,000+
Docs are vectors in this space
Each training doc a point (vector) labeled by its topic (= class)
Hypothesis: docs of the same class form a contiguous region of space
We define surfaces to delineate classes in space
The set of records available for developing classification methods is divided
into two disjoint subsets- a training set and a test set.
Process
11. SVM model implementation in R
Prepare the algorithm to classify the text documents
Train and Test the model
Measure the performance of SVM model
Things to perform..
Packages to be used in R
RTextTools
e1071(SVM), rpart
tm, Stringr, Plyr
arules
12. LITERATURE REVIEW
Title of the literature Author, Journal and
Publication
date
Learnings from
the literature
Link
Text Categorization with
Support Vector Machines:
Learning with Many Relevant
Features
Thorsten Joachims University at
Dortmund
Informatik LS8,
Baroper Str. 301
44221
Dortmund,
Germany
About the the
particular
properties of
learning with text
data and
identifies why
SVMs are
appropriate
PDF
Automatic Text Categorization
and Its Application to Text
Retrieval
Wai Lam, Miguel
Ruiz, and Padmini
Srinivasan
NOVEMBER/D
ECEMBER
1999
The application
of automatic
categorization to
text retrieval
PDF
SVM Tutorial Alexandre
KOWALCZYK
23 November,
2015
How to classify
text in R
PDF