Deceptive spam

Review Spam Classification
Tarek Amr – University of East Anglia

Introduction Mutual Information Size of Dataset k-NN Continued Conclusion
- Detecting Review Spam Top-10 terms with highest MI - Statistical nature of text varies from one dataset
- According to [Joachims-1996], Rocchio excels - As stated by [Han-2000]:
- Classification Algorithms: to the other, and results vary accordingly.
when training data is smaller.
• Naive Bayes “A major drawback of the similarity measure
• Multinomial - Naive Bayes outperformed TFIDF Algorithms.
- However, its improvement does not increase used in k-NN is that it uses all features
• Multivariate (Bernoulli) with the same rate as Naive Bayes equally in computing similarities. This can - With fewer data, Rocchio outperforms NB.
• Rocchio (Cosine/Euclidean) lead to poor similarity measures and
• K-Nearest Neighbour (C/E) - We trained Rocchio (Cosine distance) and - kNN is resource intensive, especially in testing.
classification errors, when only a small
- Preprocessors / Feature Selection Naive Bayes (MV) on subsets of our data, and
subset of the words is useful for - Feature selection is more suitable for both
• N-gram Tokenizer plotted the results:
classification”. Naive Bayes MV and kNN.
• Stemming* (Porter/Lancaster)
• Part of Speech Tagger* - Below you can see the classification - Mutual Information helps visualizing our data,
• Pruning of infrequent words - Similar to [Ott-2011] findings using LIWC accuracy in percentage for different values of let alone its use for Feature Selection.
• Mutual Information** - Almost same term-rank with Porter stemmer. k, using different features
- Would be better to try combining MI into our
- Results Evaluation Classifiers and check results accordingly.
• Accuracy - Rocchio just went from 78.25% to 78.5% with
• Precision / Recall porter stemmer (p >> 0.05) - Stemming and n-grams did not offer any
• F-Score (a=1/2 => 2PR/(P+R)) - Somehow, bi-grams and tri-grams ranks didn't significant improvement, due to the nature of the
change a lot from uni-grams top informative terms.
* NLTK package was used ** Stand-alone
'michigan ave' vs 'michigan', 'the floor' vs 'floor', - Our results for PoS using Rocchio and NB were
'husband and' and 'my husband' vs 'husband', etc. far away from SVM/PoS results
Feature Selection - Removing stop words!?

[Joachims-1996] listed 3 steps for feature - Rocchio results for unigrams (78.25%), bigrams
selection: (81.125%) and trigrams (78.625%) [p = 0.178]
- Pruning of infrequent words. (3+ times) - We also agreed with [Rayson-2001] and [Ott-
- Pruning of high frequent words. (Stop word)
- Choosing words with high Mutual Information.
2011] regarding (Truthful) PoS tags K-Nearest Neighbor
Results
Naive Bayes (Pruning of infrequent words)
- Multivariate: ↑ Accuracy (87.63% => 87.88%) Average Accuracy:
- Not statistically significant. (p = 0.58 >> 0.05)
- Same for Precision and Recall - Naive Bayes [Muli-Variate, Terms] = 87.625 %
- Multinomial: ↓ Accuracy (88.5% => 87.88) - Naive Bayes [Muli-Nomial, Terms] = 88.5 %
- Rocchio [Cosine, Terms] = 78.25 %
Rocchio (Pruning of infrequent words)
- Rocchio [Cosine, Bigrams] 81.125 %
- Steady till frequency < 7, then degradation
- My interpretation (Scientific!?) - KNN [Cosine, Min. Freq=3, k=153 ] = 76.375 %
- Truncating *shallow* axises in Vector Space!
- Centroid already not able to move much there. - Twitter dataset (@AppleNws and @NokiaUS)
References
- We got best result (Accuracy =73.875%) when
Naive Bayes MV has slightly better recall than
- 5 folds x 40 tweets k was set 105. A Probabilistic Analysis of the Rocchio Algorithm
NM (0.92 @ p=0.18), while MN has better slightly
with TFIDF for Text Categorization. [Joachims-
- Notice: We set k = k – 1, if k is even number. precision (0.88 @ p=0.012)
1996]
- Notice how accuracy goes to 50% when k = They both are much more precise than Rocchio
Centroid-based document classification: Analysis
number of documents (we have equal number of (p < 0.01), and have better recall too (p < 0.05)
- Apple is a bot. and experimental results. [Han-2000]
Truthful and Deceptive documents) However, as we have seen earlier, Rocchio
- Nokia used 'You', 'Your' and 'RT' more Grammatical word class variation within the
excels, trained on fewer data
British National Corpus Sampler. [Rayson-2001]
- Nokia uses more personal pronouns, whereas
Apple uses more Hashtags Finding deceptive opinion spam by any stretch of
the imagination [Ott-2011]
NB (97.47%), Rocchio (92.47), NB/PoS (85.91%)

Deceptive spam

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Mehr von Tarek Amr

Mehr von Tarek Amr (12)

Deceptive spam