As a part of my Information Retrieval module in the University of East Anglia, we had to build classifier to detect deceptive review spam. Review spam was described by Nitin Jindal as follows: "It is now a common practice for e-commerce Web sites to enable their customers to write reviews of products that they have purchased. Such reviews provide valuable sources of information on these products .. Unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions". This is my poster presentation where I implemented 3 classification algorithms using Python, as well as feature selection and preprocessor modules.
1. Review Spam Classification
Tarek Amr – University of East Anglia
Introduction Mutual Information Size of Dataset k-NN Continued Conclusion
- Detecting Review Spam Top-10 terms with highest MI - Statistical nature of text varies from one dataset
- According to [Joachims-1996], Rocchio excels - As stated by [Han-2000]:
- Classification Algorithms: to the other, and results vary accordingly.
when training data is smaller.
• Naive Bayes “A major drawback of the similarity measure
• Multinomial - Naive Bayes outperformed TFIDF Algorithms.
- However, its improvement does not increase used in k-NN is that it uses all features
• Multivariate (Bernoulli) with the same rate as Naive Bayes equally in computing similarities. This can - With fewer data, Rocchio outperforms NB.
• Rocchio (Cosine/Euclidean) lead to poor similarity measures and
• K-Nearest Neighbour (C/E) - We trained Rocchio (Cosine distance) and - kNN is resource intensive, especially in testing.
classification errors, when only a small
- Preprocessors / Feature Selection Naive Bayes (MV) on subsets of our data, and
subset of the words is useful for - Feature selection is more suitable for both
• N-gram Tokenizer plotted the results:
classification”. Naive Bayes MV and kNN.
• Stemming* (Porter/Lancaster)
• Part of Speech Tagger* - Below you can see the classification - Mutual Information helps visualizing our data,
• Pruning of infrequent words - Similar to [Ott-2011] findings using LIWC accuracy in percentage for different values of let alone its use for Feature Selection.
• Mutual Information** - Almost same term-rank with Porter stemmer. k, using different features
- Would be better to try combining MI into our
- Results Evaluation Classifiers and check results accordingly.
• Accuracy - Rocchio just went from 78.25% to 78.5% with
• Precision / Recall porter stemmer (p >> 0.05) - Stemming and n-grams did not offer any
• F-Score (a=1/2 => 2PR/(P+R)) - Somehow, bi-grams and tri-grams ranks didn't significant improvement, due to the nature of the
change a lot from uni-grams top informative terms.
* NLTK package was used ** Stand-alone
'michigan ave' vs 'michigan', 'the floor' vs 'floor', - Our results for PoS using Rocchio and NB were
'husband and' and 'my husband' vs 'husband', etc. far away from SVM/PoS results
Feature Selection - Removing stop words!?
[Joachims-1996] listed 3 steps for feature - Rocchio results for unigrams (78.25%), bigrams
selection: (81.125%) and trigrams (78.625%) [p = 0.178]
- Pruning of infrequent words. (3+ times) - We also agreed with [Rayson-2001] and [Ott-
- Pruning of high frequent words. (Stop word)
- Choosing words with high Mutual Information.
2011] regarding (Truthful) PoS tags K-Nearest Neighbor
Results
Naive Bayes (Pruning of infrequent words)
- Multivariate: ↑ Accuracy (87.63% => 87.88%) Average Accuracy:
- Not statistically significant. (p = 0.58 >> 0.05)
- Same for Precision and Recall - Naive Bayes [Muli-Variate, Terms] = 87.625 %
- Multinomial: ↓ Accuracy (88.5% => 87.88) - Naive Bayes [Muli-Nomial, Terms] = 88.5 %
- Rocchio [Cosine, Terms] = 78.25 %
Rocchio (Pruning of infrequent words)
- Rocchio [Cosine, Bigrams] 81.125 %
- Steady till frequency < 7, then degradation
- My interpretation (Scientific!?) - KNN [Cosine, Min. Freq=3, k=153 ] = 76.375 %
- Truncating *shallow* axises in Vector Space!
- Centroid already not able to move much there. - Twitter dataset (@AppleNws and @NokiaUS)
References
- We got best result (Accuracy =73.875%) when
Naive Bayes MV has slightly better recall than
- 5 folds x 40 tweets k was set 105. A Probabilistic Analysis of the Rocchio Algorithm
NM (0.92 @ p=0.18), while MN has better slightly
with TFIDF for Text Categorization. [Joachims-
- Notice: We set k = k – 1, if k is even number. precision (0.88 @ p=0.012)
1996]
- Notice how accuracy goes to 50% when k = They both are much more precise than Rocchio
Centroid-based document classification: Analysis
number of documents (we have equal number of (p < 0.01), and have better recall too (p < 0.05)
- Apple is a bot. and experimental results. [Han-2000]
Truthful and Deceptive documents) However, as we have seen earlier, Rocchio
- Nokia used 'You', 'Your' and 'RT' more Grammatical word class variation within the
excels, trained on fewer data
British National Corpus Sampler. [Rayson-2001]
- Nokia uses more personal pronouns, whereas
Apple uses more Hashtags Finding deceptive opinion spam by any stretch of
the imagination [Ott-2011]
NB (97.47%), Rocchio (92.47), NB/PoS (85.91%)