SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Review Spam Classification
                                                                                                       Tarek Amr – University of East Anglia


               Introduction                                  Mutual Information                                           Size of Dataset                                    k-NN Continued                                           Conclusion
  - Detecting Review Spam                               Top-10 terms with highest MI                                                                                                                                    - Statistical nature of text varies from one dataset
                                                                                                                 - According to [Joachims-1996], Rocchio excels      - As stated by [Han-2000]:
  - Classification Algorithms:                                                                                                                                                                                          to the other, and results vary accordingly.
                                                                                                                 when training data is smaller.
      • Naive Bayes                                                                                                                                                   “A major drawback of the similarity measure
           • Multinomial                                                                                                                                                                                                - Naive Bayes outperformed TFIDF Algorithms.
                                                                                                                 - However, its improvement does not increase        used in k-NN is that it uses all features
           • Multivariate (Bernoulli)                                                                            with the same rate as Naive Bayes                   equally in computing similarities. This can        - With fewer data, Rocchio outperforms NB.
      • Rocchio (Cosine/Euclidean)                                                                                                                                   lead to poor similarity measures and
      • K-Nearest Neighbour (C/E)                                                                                - We trained Rocchio (Cosine distance) and                                                             - kNN is resource intensive, especially in testing.
                                                                                                                                                                     classification errors, when only a small
  - Preprocessors / Feature Selection                                                                            Naive Bayes (MV) on subsets of our data, and
                                                                                                                                                                     subset of the words is useful for                  - Feature selection is more suitable for both
      • N-gram Tokenizer                                                                                         plotted the results:
                                                                                                                                                                     classification”.                                   Naive Bayes MV and kNN.
      • Stemming* (Porter/Lancaster)
      • Part of Speech Tagger*                                                                                                                                       - Below you can see the classification             - Mutual Information helps visualizing our data,
      • Pruning of infrequent words                     - Similar to [Ott-2011] findings using LIWC                                                                  accuracy in percentage for different values of     let alone its use for Feature Selection.
      • Mutual Information**                            - Almost same term-rank with Porter stemmer.                                                                 k, using different features
                                                                                                                                                                                                                        - Would be better to try combining MI into our
  - Results Evaluation                                                                                                                                                                                                  Classifiers and check results accordingly.
      • Accuracy                                        - Rocchio just went from 78.25% to 78.5% with
      • Precision / Recall                              porter stemmer (p >> 0.05)                                                                                                                                      - Stemming and n-grams did not offer any
      • F-Score (a=1/2 => 2PR/(P+R))                    - Somehow, bi-grams and tri-grams ranks didn't                                                                                                                  significant improvement, due to the nature of the
                                                        change a lot from uni-grams                                                                                                                                     top informative terms.
   * NLTK package was used          ** Stand-alone
                                                         'michigan ave' vs 'michigan', 'the floor' vs 'floor',                                                                                                          - Our results for PoS using Rocchio and NB were
                                                        'husband and' and 'my husband' vs 'husband', etc.                                                                                                               far away from SVM/PoS results
         Feature Selection                              - Removing stop words!?

[Joachims-1996] listed 3 steps for feature              - Rocchio results for unigrams (78.25%), bigrams
selection:                                              (81.125%) and trigrams (78.625%) [p = 0.178]
- Pruning of infrequent words. (3+ times)               - We also agreed with [Rayson-2001] and [Ott-
- Pruning of high frequent words. (Stop word)
- Choosing words with high Mutual Information.
                                                        2011] regarding (Truthful) PoS tags                          K-Nearest Neighbor
                                                                                                                                                                                      Results
Naive Bayes (Pruning of infrequent words)
- Multivariate: ↑ Accuracy (87.63% => 87.88%)                                                                                                                       Average Accuracy:
  - Not statistically significant. (p = 0.58 >> 0.05)
  - Same for Precision and Recall                                                                                                                                   - Naive Bayes [Muli-Variate, Terms] = 87.625 %
- Multinomial: ↓ Accuracy (88.5% => 87.88)                                                                                                                          - Naive Bayes [Muli-Nomial, Terms] = 88.5 %
                                                                                                                                                                    - Rocchio [Cosine, Terms] = 78.25 %
Rocchio (Pruning of infrequent words)
                                                                                                                                                                    - Rocchio [Cosine, Bigrams] 81.125 %
- Steady till frequency < 7, then degradation
- My interpretation (Scientific!?)                                                                                                                                  - KNN [Cosine, Min. Freq=3, k=153 ] = 76.375 %
  - Truncating *shallow* axises in Vector Space!
  - Centroid already not able to move much there.       - Twitter dataset (@AppleNws and @NokiaUS)
                                                                                                                                                                                                                                      References
                                                                                                                 - We got best result (Accuracy =73.875%) when
                                                                                                                                                                    Naive Bayes MV has slightly better recall than
                                                        - 5 folds x 40 tweets                                    k was set 105.                                                                                         A Probabilistic Analysis of the Rocchio Algorithm
                                                                                                                                                                    NM (0.92 @ p=0.18), while MN has better slightly
                                                                                                                                                                                                                        with TFIDF for Text Categorization. [Joachims-
                                                                                                                 - Notice: We set k = k – 1, if k is even number.   precision (0.88 @ p=0.012)
                                                                                                                                                                                                                        1996]
                                                                                                                 - Notice how accuracy goes to 50% when k =         They both are much more precise than Rocchio
                                                                                                                                                                                                                        Centroid-based document classification: Analysis
                                                                                                                 number of documents (we have equal number of       (p < 0.01), and have better recall too (p < 0.05)
                                                        - Apple is a bot.                                                                                                                                               and experimental results. [Han-2000]
                                                                                                                 Truthful and Deceptive documents)                  However, as we have seen earlier, Rocchio
                                                          - Nokia used 'You', 'Your' and 'RT' more                                                                                                                      Grammatical word class variation within the
                                                                                                                                                                    excels, trained on fewer data
                                                                                                                                                                                                                        British National Corpus Sampler. [Rayson-2001]
                                                         - Nokia uses more personal pronouns, whereas
                                                        Apple uses more Hashtags                                                                                                                                        Finding deceptive opinion spam by any stretch of
                                                                                                                                                                                                                        the imagination [Ott-2011]
                                                        NB (97.47%), Rocchio (92.47), NB/PoS (85.91%)

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Web Usage Pattern
Web Usage PatternWeb Usage Pattern
Web Usage Pattern
 
Spam Filtering
Spam FilteringSpam Filtering
Spam Filtering
 
Graphs In Data Structure
Graphs In Data StructureGraphs In Data Structure
Graphs In Data Structure
 
Queue data structure
Queue data structureQueue data structure
Queue data structure
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
STACKS IN DATASTRUCTURE
STACKS IN DATASTRUCTURESTACKS IN DATASTRUCTURE
STACKS IN DATASTRUCTURE
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
17. Trees and Graphs
17. Trees and Graphs17. Trees and Graphs
17. Trees and Graphs
 
Trees data structure
Trees data structureTrees data structure
Trees data structure
 

Mehr von Tarek Amr

Open Access in the Arab World - AUC Open Access Days
Open Access in the Arab World - AUC Open Access DaysOpen Access in the Arab World - AUC Open Access Days
Open Access in the Arab World - AUC Open Access DaysTarek Amr
 
GVMeetup Nov 2013
GVMeetup Nov 2013GVMeetup Nov 2013
GVMeetup Nov 2013Tarek Amr
 
Human-Computer Interaction
Human-Computer InteractionHuman-Computer Interaction
Human-Computer InteractionTarek Amr
 
Data Visualization
Data VisualizationData Visualization
Data VisualizationTarek Amr
 
Social Streets in Revolutionary Egypt
Social Streets in Revolutionary EgyptSocial Streets in Revolutionary Egypt
Social Streets in Revolutionary EgyptTarek Amr
 
Story Curation
Story CurationStory Curation
Story CurationTarek Amr
 
إعداد تقارير صحفية
إعداد تقارير صحفيةإعداد تقارير صحفية
إعداد تقارير صحفيةTarek Amr
 
Linux Based DiffServ. Router
Linux Based DiffServ. RouterLinux Based DiffServ. Router
Linux Based DiffServ. RouterTarek Amr
 
SM Role After #Jan25
SM Role After #Jan25SM Role After #Jan25
SM Role After #Jan25Tarek Amr
 
Networked Revolts - Egypt
Networked Revolts - EgyptNetworked Revolts - Egypt
Networked Revolts - EgyptTarek Amr
 
Social Tools and Middle East Uprising
Social Tools and Middle East Uprising Social Tools and Middle East Uprising
Social Tools and Middle East Uprising Tarek Amr
 
Failed Startups - Lessons Learned
Failed Startups - Lessons LearnedFailed Startups - Lessons Learned
Failed Startups - Lessons LearnedTarek Amr
 

Mehr von Tarek Amr (12)

Open Access in the Arab World - AUC Open Access Days
Open Access in the Arab World - AUC Open Access DaysOpen Access in the Arab World - AUC Open Access Days
Open Access in the Arab World - AUC Open Access Days
 
GVMeetup Nov 2013
GVMeetup Nov 2013GVMeetup Nov 2013
GVMeetup Nov 2013
 
Human-Computer Interaction
Human-Computer InteractionHuman-Computer Interaction
Human-Computer Interaction
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Social Streets in Revolutionary Egypt
Social Streets in Revolutionary EgyptSocial Streets in Revolutionary Egypt
Social Streets in Revolutionary Egypt
 
Story Curation
Story CurationStory Curation
Story Curation
 
إعداد تقارير صحفية
إعداد تقارير صحفيةإعداد تقارير صحفية
إعداد تقارير صحفية
 
Linux Based DiffServ. Router
Linux Based DiffServ. RouterLinux Based DiffServ. Router
Linux Based DiffServ. Router
 
SM Role After #Jan25
SM Role After #Jan25SM Role After #Jan25
SM Role After #Jan25
 
Networked Revolts - Egypt
Networked Revolts - EgyptNetworked Revolts - Egypt
Networked Revolts - Egypt
 
Social Tools and Middle East Uprising
Social Tools and Middle East Uprising Social Tools and Middle East Uprising
Social Tools and Middle East Uprising
 
Failed Startups - Lessons Learned
Failed Startups - Lessons LearnedFailed Startups - Lessons Learned
Failed Startups - Lessons Learned
 

Deceptive spam

  • 1. Review Spam Classification Tarek Amr – University of East Anglia Introduction Mutual Information Size of Dataset k-NN Continued Conclusion - Detecting Review Spam Top-10 terms with highest MI - Statistical nature of text varies from one dataset - According to [Joachims-1996], Rocchio excels - As stated by [Han-2000]: - Classification Algorithms: to the other, and results vary accordingly. when training data is smaller. • Naive Bayes “A major drawback of the similarity measure • Multinomial - Naive Bayes outperformed TFIDF Algorithms. - However, its improvement does not increase used in k-NN is that it uses all features • Multivariate (Bernoulli) with the same rate as Naive Bayes equally in computing similarities. This can - With fewer data, Rocchio outperforms NB. • Rocchio (Cosine/Euclidean) lead to poor similarity measures and • K-Nearest Neighbour (C/E) - We trained Rocchio (Cosine distance) and - kNN is resource intensive, especially in testing. classification errors, when only a small - Preprocessors / Feature Selection Naive Bayes (MV) on subsets of our data, and subset of the words is useful for - Feature selection is more suitable for both • N-gram Tokenizer plotted the results: classification”. Naive Bayes MV and kNN. • Stemming* (Porter/Lancaster) • Part of Speech Tagger* - Below you can see the classification - Mutual Information helps visualizing our data, • Pruning of infrequent words - Similar to [Ott-2011] findings using LIWC accuracy in percentage for different values of let alone its use for Feature Selection. • Mutual Information** - Almost same term-rank with Porter stemmer. k, using different features - Would be better to try combining MI into our - Results Evaluation Classifiers and check results accordingly. • Accuracy - Rocchio just went from 78.25% to 78.5% with • Precision / Recall porter stemmer (p >> 0.05) - Stemming and n-grams did not offer any • F-Score (a=1/2 => 2PR/(P+R)) - Somehow, bi-grams and tri-grams ranks didn't significant improvement, due to the nature of the change a lot from uni-grams top informative terms. * NLTK package was used ** Stand-alone 'michigan ave' vs 'michigan', 'the floor' vs 'floor', - Our results for PoS using Rocchio and NB were 'husband and' and 'my husband' vs 'husband', etc. far away from SVM/PoS results Feature Selection - Removing stop words!? [Joachims-1996] listed 3 steps for feature - Rocchio results for unigrams (78.25%), bigrams selection: (81.125%) and trigrams (78.625%) [p = 0.178] - Pruning of infrequent words. (3+ times) - We also agreed with [Rayson-2001] and [Ott- - Pruning of high frequent words. (Stop word) - Choosing words with high Mutual Information. 2011] regarding (Truthful) PoS tags K-Nearest Neighbor Results Naive Bayes (Pruning of infrequent words) - Multivariate: ↑ Accuracy (87.63% => 87.88%) Average Accuracy: - Not statistically significant. (p = 0.58 >> 0.05) - Same for Precision and Recall - Naive Bayes [Muli-Variate, Terms] = 87.625 % - Multinomial: ↓ Accuracy (88.5% => 87.88) - Naive Bayes [Muli-Nomial, Terms] = 88.5 % - Rocchio [Cosine, Terms] = 78.25 % Rocchio (Pruning of infrequent words) - Rocchio [Cosine, Bigrams] 81.125 % - Steady till frequency < 7, then degradation - My interpretation (Scientific!?) - KNN [Cosine, Min. Freq=3, k=153 ] = 76.375 % - Truncating *shallow* axises in Vector Space! - Centroid already not able to move much there. - Twitter dataset (@AppleNws and @NokiaUS) References - We got best result (Accuracy =73.875%) when Naive Bayes MV has slightly better recall than - 5 folds x 40 tweets k was set 105. A Probabilistic Analysis of the Rocchio Algorithm NM (0.92 @ p=0.18), while MN has better slightly with TFIDF for Text Categorization. [Joachims- - Notice: We set k = k – 1, if k is even number. precision (0.88 @ p=0.012) 1996] - Notice how accuracy goes to 50% when k = They both are much more precise than Rocchio Centroid-based document classification: Analysis number of documents (we have equal number of (p < 0.01), and have better recall too (p < 0.05) - Apple is a bot. and experimental results. [Han-2000] Truthful and Deceptive documents) However, as we have seen earlier, Rocchio - Nokia used 'You', 'Your' and 'RT' more Grammatical word class variation within the excels, trained on fewer data British National Corpus Sampler. [Rayson-2001] - Nokia uses more personal pronouns, whereas Apple uses more Hashtags Finding deceptive opinion spam by any stretch of the imagination [Ott-2011] NB (97.47%), Rocchio (92.47), NB/PoS (85.91%)