SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Opinion Spam and Analysis
Nitin Jindal and Bing Liu
(WSDM, 2008)
1
three types of spam reviews
• Type 1 (untruthful opinions): also known as fake reviews or bogus
reviews.
2
deliberately mislead
readers or opinion mining systems
1.To promote some target objects
(hype spam).
2. To damage the reputation of
some other target objects
(defaming spam).
three types of spam reviews
• Type 2 (reviews on brands only): not comment on the products for
the products but only the brands, the manufacturers or the sellers.
• Type 3 (non-reviews): two main sub-types:
• advertisements
• other irrelevant reviews containing no opinions (questions, answers ..)
• They are not targeted at the specific products and are often biased.
3
Review Data from
amazon.com
• Each amazon.com’s review consists of 8 parts
• <Product ID> <Reviewer ID> <Rating> <Date> <Review Title> <Revie
>
4
1
2
3 45
6
7 8
5
Reviews, Reviewer, and
Products
6
Figure 1. Log-log plot of number of reviews
to number of members for amazon.
Figure 2. Log-log plot of number of reviews
to number of products for amazon.
• There are 2 reviewers with more than 15,000 reviews.
• 68% of reviewers wrote only 1 review.
• 50% of products have only 1 review.
• Only 19% of the products have at least 5 reviews.
Review Ratings and
Feedbacks
7
Figure 3. Log-log plot of number of reviews
to number of feedbacks for amazon.
Figure 4. Rating vs. percent of reviews
• 60% of the reviews have a rating of 5.0.
• On average, a review gets 7 feedbacks.
• The percentage of positive feedbacks of a review decreases rapidly
from the first review of a product to the last.
Spam Detection
• a classification problem with two classes, spam and non-spam.
• manually label training examples for spam reviews of type 2
and type 3.
• recognizing whether a review is an untruthful opinion spam
(type 1) is extremely difficult by manually reading the review.
• found a large number of duplicate and near-duplicate reviews.
• 1. Duplicates from different userids on the same product.
• 2. Duplicates from the same userid on different products.
• 3. Duplicates from different userids on different products.
8
Detection of Duplicate Reviews
• shingle method (2-grams)
• Jaccard distance (Similarity score) > 90%  duplicates.
• The maximum similarity score : the maximum of similarity
scores between different reviews of a reviewer.
9
Detecting Type 2 & Type 3
• Model Building Using Logistic Regression
• Manually labeled 40 spam reviews of the two types.
• It produces a probability estimate of each review being a spam.
10
Feature Identification and
Construction
• There are three main types of information
• (1) the content of the review,
• (2) the reviewer who wrote the review,
• (3) the product being reviewed.
• three types of features:
• (1) review centric features,
• (2) reviewer centric features,
• (3) product centric features.
• three types based on their average ratings
• Good (rating ≥ 4), bad (rating ≤ 2.5) and Average, otherwise
11
Review Centric Features
• number of
feedbacks(F1)
• number of helpful
feedbacks(F2)
• percent of helpful
feedbacks(F3)
• length of the review
title(F4)
• length of review
body(F5)
• Position of the
review of a product
sorted by date.
ascending (F6) and
descending (F7)
• the first review (F8)
• the only review (F9)
12
Review Centric Features -Textual
features• percent of positive bearing
words(F10)
• percent of negative bearing
words(F11)
• cosine similarity (F12)
• percent of times brand
name (F13)
• percent of numerals
words(F14)
• percent of capitals
words(F15)
• percent of all capital
words(F16)
13
• rating of the review (F17)
• the deviation from product rating (F18)
• the review is good, average or bad (F19)
• a bad review was written just after the first good review of
the product and vice versa (F20, F21)
Review Centric Features –Rating related features
Reviewer Centric Features
• Ratio of the number of reviews that the reviewer wrote which
were the first reviews (F22)
• ratio of the number of cases in which he/she was the only
reviewer (F23)
• average rating given by reviewer (F24)
• standard deviation in rating (F25)
• the reviewer always gave only good, average or bad rating
(F26)
• a reviewer gave both good and bad ratings (F27)
• a reviewer gave good rating and average rating (F28)
• a reviewer gave bad rating and average rating (F29)
• a reviewer gave all three ratings (F30)
• percent of times that a reviewer wrote a review with binary
features F20 (F31) and F21 (F32).
14
Product Centric Features
• Price of the product (F33)
• Sales rank of the product (F34)
• Average rating (F35)
• standard deviation in ratings (F36)
15
Results of Type 2 and Type 3 Spam
Detection
• logistic regression on the data using 470 spam reviews for
positive class and rest of the reviews for negative class.
• 10-fold cross validation.
16
Analysis of Type 1 Spam Reviews
• 1. To promote some target objects (hype spam).
• 2. To damage the reputation of some other target
objects (defaming spam).
17
Making Use of Duplicates
• the same person writes the same review for different versions
of the same product may not be spam.
• We propose to treat all duplicate spam reviews as positive
examples, and the rest of the reviews as negative examples.
• We then use them to learn a model to identify non-duplicate
reviews with similar characteristics, which are likely to be spam
reviews.
18
Model Building Using
Duplicates
• building the logistic regression model using duplicates and
non-duplicates is not for detecting duplicate spam
• Our real purpose is to use the model to identify type 1 spam
reviews that are not duplicated.
• check whether it can predict outlier reviews
19
• Outlier reviews : whose ratings deviate from the average
product rating a great deal.
• Sentiment classification techniques may be used to
automatically assign a rating to a review solely based on its
review content.
20
Predicting Outlier Reviews
• Negative deviation is considered as less than -1 from the
mean rating and positive deviation as larger than +1 from the
mean rating of the product.
• Spammers may not want their review ratings to deviate too
much from the norm to make the reviews too suspicious.
21
22
X% of reviews
(the test data)
cumulated
percentage of
reviews of the
current bin
Some Other Interesting Reviews
- Only Reviews
• We did not use any position features of reviews (F6, F7, F8, F9, F20 and
F21) and number of reviews of product (F12, F18, F35, and F36) related
features in model building to prevent overfitting.
• only reviews are very likely to be candidates of spam
23
- Reviews from Top-Ranked Reviewers
• Top-ranked reviewers generally write a large number of reviews, much
more than bottom ranked reviewers.
• Top ranked reviewers also score high on some important indicators of
spam reviews.
 top ranked reviewers are less trustworthy as compared to bottom ranked
reviewers.
24
- Reviews with Different Levels of Feedbacks
• If usefulness of a review is defined based on the feedbacks
that the review gets, it means that people can be readily
fooled by a spam review.
feedback spam
25
- Reviews of Products with Varied Sales
Ranks
• Spam activities are more limited to low selling products. 
difficult to damage reputation of a high selling or popular
product by writing a spam review.
26
Conclusions & Future work
• Results showed that the logistic regression model is highly
effective.
• It is very hard to manually label training examples for type 1
spam.
• We will further improve the detection methods, and also look
into spam in other kinds of media, e.g., forums and blogs.
27

Weitere ähnliche Inhalte

Was ist angesagt?

Distribution transparency and Distributed transaction
Distribution transparency and Distributed transactionDistribution transparency and Distributed transaction
Distribution transparency and Distributed transactionshraddha mane
 
Virtualization (Distributed computing)
Virtualization (Distributed computing)Virtualization (Distributed computing)
Virtualization (Distributed computing)Sri Prasanna
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
 
Applications of Distributed Systems
Applications of Distributed SystemsApplications of Distributed Systems
Applications of Distributed Systemssandra sukarieh
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...Lê Anh Đạt
 
Face recognition ppt
Face recognition pptFace recognition ppt
Face recognition pptSantosh Kumar
 
A content based movie recommender system for mobile application
A content based movie recommender system for mobile applicationA content based movie recommender system for mobile application
A content based movie recommender system for mobile applicationArafat X
 
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Databricks
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information RetrievalHarsh Thakkar
 
Lecture 26 local beam search
Lecture 26 local beam searchLecture 26 local beam search
Lecture 26 local beam searchHema Kashyap
 
Detecting Fake News Through NLP
Detecting Fake News Through NLPDetecting Fake News Through NLP
Detecting Fake News Through NLPSakha Global
 
Rule Based Architecture System
Rule Based Architecture SystemRule Based Architecture System
Rule Based Architecture SystemFirdaus Adib
 
HCI - Chapter 1
HCI - Chapter 1HCI - Chapter 1
HCI - Chapter 1Alan Dix
 
Artificial intelligence and knowledge representation
Artificial intelligence and knowledge representationArtificial intelligence and knowledge representation
Artificial intelligence and knowledge representationSajan Sahu
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySeminar Links
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project reportSarang Afle
 
Introduction of Artificial Intelligence
Introduction of Artificial IntelligenceIntroduction of Artificial Intelligence
Introduction of Artificial IntelligenceAkhileshwar Nirala
 

Was ist angesagt? (20)

Distribution transparency and Distributed transaction
Distribution transparency and Distributed transactionDistribution transparency and Distributed transaction
Distribution transparency and Distributed transaction
 
Virtualization (Distributed computing)
Virtualization (Distributed computing)Virtualization (Distributed computing)
Virtualization (Distributed computing)
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
Applications of Distributed Systems
Applications of Distributed SystemsApplications of Distributed Systems
Applications of Distributed Systems
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...
 
Face recognition ppt
Face recognition pptFace recognition ppt
Face recognition ppt
 
A content based movie recommender system for mobile application
A content based movie recommender system for mobile applicationA content based movie recommender system for mobile application
A content based movie recommender system for mobile application
 
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
 
Interaction Paradigms
Interaction ParadigmsInteraction Paradigms
Interaction Paradigms
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 
Lecture 26 local beam search
Lecture 26 local beam searchLecture 26 local beam search
Lecture 26 local beam search
 
Detecting Fake News Through NLP
Detecting Fake News Through NLPDetecting Fake News Through NLP
Detecting Fake News Through NLP
 
Rule Based Architecture System
Rule Based Architecture SystemRule Based Architecture System
Rule Based Architecture System
 
HCI - Chapter 1
HCI - Chapter 1HCI - Chapter 1
HCI - Chapter 1
 
Neural network
Neural networkNeural network
Neural network
 
Artificial intelligence and knowledge representation
Artificial intelligence and knowledge representationArtificial intelligence and knowledge representation
Artificial intelligence and knowledge representation
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project report
 
Introduction of Artificial Intelligence
Introduction of Artificial IntelligenceIntroduction of Artificial Intelligence
Introduction of Artificial Intelligence
 

Andere mochten auch

Libro2 personal planillas notas
Libro2 personal planillas notasLibro2 personal planillas notas
Libro2 personal planillas notascles12
 
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...SOYEON KIM
 
Opinion Fraud Detection in Online Reviews by Network Effects
Opinion Fraud Detection in Online Reviews by Network EffectsOpinion Fraud Detection in Online Reviews by Network Effects
Opinion Fraud Detection in Online Reviews by Network EffectsSOYEON KIM
 
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining TechniquesSurvey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining Techniquesijsrd.com
 
Common problems with email
Common problems with emailCommon problems with email
Common problems with emailTal Aviv
 
Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)k.surya kumar
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionanthonytaylor01
 
Credit Card Fraud Detection System: A Survey
Credit Card Fraud Detection System: A SurveyCredit Card Fraud Detection System: A Survey
Credit Card Fraud Detection System: A SurveyIJMER
 
Fraud Detection presentation
Fraud Detection presentationFraud Detection presentation
Fraud Detection presentationHernan Huwyler
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionkalpesh1908
 
Presentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlPresentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlDominic Sroda Korkoryi
 

Andere mochten auch (12)

Libro2 personal planillas notas
Libro2 personal planillas notasLibro2 personal planillas notas
Libro2 personal planillas notas
 
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
 
Opinion Fraud Detection in Online Reviews by Network Effects
Opinion Fraud Detection in Online Reviews by Network EffectsOpinion Fraud Detection in Online Reviews by Network Effects
Opinion Fraud Detection in Online Reviews by Network Effects
 
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining TechniquesSurvey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
 
Common problems with email
Common problems with emailCommon problems with email
Common problems with email
 
Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Credit Card Fraud Detection System: A Survey
Credit Card Fraud Detection System: A SurveyCredit Card Fraud Detection System: A Survey
Credit Card Fraud Detection System: A Survey
 
Fraud Detection presentation
Fraud Detection presentationFraud Detection presentation
Fraud Detection presentation
 
Fraud detection
Fraud detectionFraud detection
Fraud detection
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Presentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlPresentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & control
 

Ähnlich wie Opinion spam and analysis

Opinion Mining and Opinion Spam Detection
Opinion Mining and Opinion Spam DetectionOpinion Mining and Opinion Spam Detection
Opinion Mining and Opinion Spam DetectionIRJET Journal
 
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020Syeda Kauser Inamdar
 
Market and competitor research
Market and competitor researchMarket and competitor research
Market and competitor researchSohaib Thiab
 
Fraud Detection in Online Reviews using Machine Learning Techniques
Fraud Detection in Online Reviews using Machine Learning TechniquesFraud Detection in Online Reviews using Machine Learning Techniques
Fraud Detection in Online Reviews using Machine Learning Techniquesijceronline
 
How and When To Code Review
How and When To Code ReviewHow and When To Code Review
How and When To Code ReviewPaul Gower
 
Customer Content Marketing
Customer Content MarketingCustomer Content Marketing
Customer Content MarketingAaron Weiche
 
Control Your Online Reputation - MSP Social Media Breakfast
Control Your Online Reputation - MSP Social Media BreakfastControl Your Online Reputation - MSP Social Media Breakfast
Control Your Online Reputation - MSP Social Media BreakfastAaron Weiche
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support SystemKavita Ganesan
 
Evaluating Brand Attributes in a Digital Environment
Evaluating Brand Attributes in a Digital EnvironmentEvaluating Brand Attributes in a Digital Environment
Evaluating Brand Attributes in a Digital EnvironmentFlexMR
 
Code Review: How And When
Code Review: How And WhenCode Review: How And When
Code Review: How And WhenPaul Gower
 
Social media metrics and ROI
Social media metrics and ROISocial media metrics and ROI
Social media metrics and ROIInsightAsia
 
The Journey from User Acquisition to Advocacy - CleverTap & AppFollow
The Journey from User Acquisition to Advocacy - CleverTap & AppFollowThe Journey from User Acquisition to Advocacy - CleverTap & AppFollow
The Journey from User Acquisition to Advocacy - CleverTap & AppFollowCleverTap
 
Collective Opinion Spam Detection Bridging Review Networks and Metadata
Collective Opinion Spam Detection Bridging Review Networks and MetadataCollective Opinion Spam Detection Bridging Review Networks and Metadata
Collective Opinion Spam Detection Bridging Review Networks and MetadataShebuti Rayana
 
1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docxjackiewalcutt
 
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...UserZoom
 

Ähnlich wie Opinion spam and analysis (20)

Opinion Mining and Opinion Spam Detection
Opinion Mining and Opinion Spam DetectionOpinion Mining and Opinion Spam Detection
Opinion Mining and Opinion Spam Detection
 
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
 
Mahendra nath
Mahendra nathMahendra nath
Mahendra nath
 
Market and competitor research
Market and competitor researchMarket and competitor research
Market and competitor research
 
Fraud Detection in Online Reviews using Machine Learning Techniques
Fraud Detection in Online Reviews using Machine Learning TechniquesFraud Detection in Online Reviews using Machine Learning Techniques
Fraud Detection in Online Reviews using Machine Learning Techniques
 
Google: Spotting Fake Reviewer Groups
Google: Spotting Fake Reviewer GroupsGoogle: Spotting Fake Reviewer Groups
Google: Spotting Fake Reviewer Groups
 
Npd
NpdNpd
Npd
 
How and When To Code Review
How and When To Code ReviewHow and When To Code Review
How and When To Code Review
 
Customer Content Marketing
Customer Content MarketingCustomer Content Marketing
Customer Content Marketing
 
Control Your Online Reputation - MSP Social Media Breakfast
Control Your Online Reputation - MSP Social Media BreakfastControl Your Online Reputation - MSP Social Media Breakfast
Control Your Online Reputation - MSP Social Media Breakfast
 
Online hotel reviews: rating symbols or text... text of rating symbols? That ...
Online hotel reviews: rating symbols or text... text of rating symbols? That ...Online hotel reviews: rating symbols or text... text of rating symbols? That ...
Online hotel reviews: rating symbols or text... text of rating symbols? That ...
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
Evaluating Brand Attributes in a Digital Environment
Evaluating Brand Attributes in a Digital EnvironmentEvaluating Brand Attributes in a Digital Environment
Evaluating Brand Attributes in a Digital Environment
 
Code Review: How And When
Code Review: How And WhenCode Review: How And When
Code Review: How And When
 
Social media metrics and ROI
Social media metrics and ROISocial media metrics and ROI
Social media metrics and ROI
 
The Journey from User Acquisition to Advocacy - CleverTap & AppFollow
The Journey from User Acquisition to Advocacy - CleverTap & AppFollowThe Journey from User Acquisition to Advocacy - CleverTap & AppFollow
The Journey from User Acquisition to Advocacy - CleverTap & AppFollow
 
Collective Opinion Spam Detection Bridging Review Networks and Metadata
Collective Opinion Spam Detection Bridging Review Networks and MetadataCollective Opinion Spam Detection Bridging Review Networks and Metadata
Collective Opinion Spam Detection Bridging Review Networks and Metadata
 
1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx
 
GH_Final1.1
GH_Final1.1GH_Final1.1
GH_Final1.1
 
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
 

Mehr von SOYEON KIM

Network-based machine learning approach for aggregating multi-modal data
Network-based machine learning approach for aggregating multi-modal dataNetwork-based machine learning approach for aggregating multi-modal data
Network-based machine learning approach for aggregating multi-modal dataSOYEON KIM
 
Revealing disease-associated pathways by network integration of untargeted me...
Revealing disease-associated pathways by network integration of untargeted me...Revealing disease-associated pathways by network integration of untargeted me...
Revealing disease-associated pathways by network integration of untargeted me...SOYEON KIM
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSOYEON KIM
 
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...SOYEON KIM
 
Network embedding
Network embeddingNetwork embedding
Network embeddingSOYEON KIM
 
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...SOYEON KIM
 
Deep learning based multi-omics integration, a survey
Deep learning based multi-omics integration, a surveyDeep learning based multi-omics integration, a survey
Deep learning based multi-omics integration, a surveySOYEON KIM
 
DeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social RepresentationsDeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social RepresentationsSOYEON KIM
 
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringConvolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringSOYEON KIM
 
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image SearchVisual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image SearchSOYEON KIM
 
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...SOYEON KIM
 
A survey of heterogeneous information network analysis
A survey of heterogeneous information network analysisA survey of heterogeneous information network analysis
A survey of heterogeneous information network analysisSOYEON KIM
 
Translated learning
Translated learningTranslated learning
Translated learningSOYEON KIM
 
Self taught clustering
Self taught clusteringSelf taught clustering
Self taught clusteringSOYEON KIM
 
Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...SOYEON KIM
 
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...SOYEON KIM
 
Text extraction from natural scene image, a survey
Text extraction from natural scene image, a surveyText extraction from natural scene image, a survey
Text extraction from natural scene image, a surveySOYEON KIM
 
Evaluating color descriptors for object and scene recognition
Evaluating color descriptors for object and scene recognitionEvaluating color descriptors for object and scene recognition
Evaluating color descriptors for object and scene recognitionSOYEON KIM
 
Outcome-guided mutual information networks for investigating gene-gene intera...
Outcome-guided mutual information networks for investigating gene-gene intera...Outcome-guided mutual information networks for investigating gene-gene intera...
Outcome-guided mutual information networks for investigating gene-gene intera...SOYEON KIM
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clusteringSOYEON KIM
 

Mehr von SOYEON KIM (20)

Network-based machine learning approach for aggregating multi-modal data
Network-based machine learning approach for aggregating multi-modal dataNetwork-based machine learning approach for aggregating multi-modal data
Network-based machine learning approach for aggregating multi-modal data
 
Revealing disease-associated pathways by network integration of untargeted me...
Revealing disease-associated pathways by network integration of untargeted me...Revealing disease-associated pathways by network integration of untargeted me...
Revealing disease-associated pathways by network integration of untargeted me...
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traits
 
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
 
Network embedding
Network embeddingNetwork embedding
Network embedding
 
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
 
Deep learning based multi-omics integration, a survey
Deep learning based multi-omics integration, a surveyDeep learning based multi-omics integration, a survey
Deep learning based multi-omics integration, a survey
 
DeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social RepresentationsDeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social Representations
 
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringConvolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
 
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image SearchVisual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
 
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
 
A survey of heterogeneous information network analysis
A survey of heterogeneous information network analysisA survey of heterogeneous information network analysis
A survey of heterogeneous information network analysis
 
Translated learning
Translated learningTranslated learning
Translated learning
 
Self taught clustering
Self taught clusteringSelf taught clustering
Self taught clustering
 
Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...
 
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
 
Text extraction from natural scene image, a survey
Text extraction from natural scene image, a surveyText extraction from natural scene image, a survey
Text extraction from natural scene image, a survey
 
Evaluating color descriptors for object and scene recognition
Evaluating color descriptors for object and scene recognitionEvaluating color descriptors for object and scene recognition
Evaluating color descriptors for object and scene recognition
 
Outcome-guided mutual information networks for investigating gene-gene intera...
Outcome-guided mutual information networks for investigating gene-gene intera...Outcome-guided mutual information networks for investigating gene-gene intera...
Outcome-guided mutual information networks for investigating gene-gene intera...
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 

Kürzlich hochgeladen

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Kürzlich hochgeladen (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

Opinion spam and analysis

  • 1. Opinion Spam and Analysis Nitin Jindal and Bing Liu (WSDM, 2008) 1
  • 2. three types of spam reviews • Type 1 (untruthful opinions): also known as fake reviews or bogus reviews. 2 deliberately mislead readers or opinion mining systems 1.To promote some target objects (hype spam). 2. To damage the reputation of some other target objects (defaming spam).
  • 3. three types of spam reviews • Type 2 (reviews on brands only): not comment on the products for the products but only the brands, the manufacturers or the sellers. • Type 3 (non-reviews): two main sub-types: • advertisements • other irrelevant reviews containing no opinions (questions, answers ..) • They are not targeted at the specific products and are often biased. 3
  • 4. Review Data from amazon.com • Each amazon.com’s review consists of 8 parts • <Product ID> <Reviewer ID> <Rating> <Date> <Review Title> <Revie > 4 1 2 3 45 6 7 8
  • 5. 5
  • 6. Reviews, Reviewer, and Products 6 Figure 1. Log-log plot of number of reviews to number of members for amazon. Figure 2. Log-log plot of number of reviews to number of products for amazon. • There are 2 reviewers with more than 15,000 reviews. • 68% of reviewers wrote only 1 review. • 50% of products have only 1 review. • Only 19% of the products have at least 5 reviews.
  • 7. Review Ratings and Feedbacks 7 Figure 3. Log-log plot of number of reviews to number of feedbacks for amazon. Figure 4. Rating vs. percent of reviews • 60% of the reviews have a rating of 5.0. • On average, a review gets 7 feedbacks. • The percentage of positive feedbacks of a review decreases rapidly from the first review of a product to the last.
  • 8. Spam Detection • a classification problem with two classes, spam and non-spam. • manually label training examples for spam reviews of type 2 and type 3. • recognizing whether a review is an untruthful opinion spam (type 1) is extremely difficult by manually reading the review. • found a large number of duplicate and near-duplicate reviews. • 1. Duplicates from different userids on the same product. • 2. Duplicates from the same userid on different products. • 3. Duplicates from different userids on different products. 8
  • 9. Detection of Duplicate Reviews • shingle method (2-grams) • Jaccard distance (Similarity score) > 90%  duplicates. • The maximum similarity score : the maximum of similarity scores between different reviews of a reviewer. 9
  • 10. Detecting Type 2 & Type 3 • Model Building Using Logistic Regression • Manually labeled 40 spam reviews of the two types. • It produces a probability estimate of each review being a spam. 10
  • 11. Feature Identification and Construction • There are three main types of information • (1) the content of the review, • (2) the reviewer who wrote the review, • (3) the product being reviewed. • three types of features: • (1) review centric features, • (2) reviewer centric features, • (3) product centric features. • three types based on their average ratings • Good (rating ≥ 4), bad (rating ≤ 2.5) and Average, otherwise 11
  • 12. Review Centric Features • number of feedbacks(F1) • number of helpful feedbacks(F2) • percent of helpful feedbacks(F3) • length of the review title(F4) • length of review body(F5) • Position of the review of a product sorted by date. ascending (F6) and descending (F7) • the first review (F8) • the only review (F9) 12
  • 13. Review Centric Features -Textual features• percent of positive bearing words(F10) • percent of negative bearing words(F11) • cosine similarity (F12) • percent of times brand name (F13) • percent of numerals words(F14) • percent of capitals words(F15) • percent of all capital words(F16) 13 • rating of the review (F17) • the deviation from product rating (F18) • the review is good, average or bad (F19) • a bad review was written just after the first good review of the product and vice versa (F20, F21) Review Centric Features –Rating related features
  • 14. Reviewer Centric Features • Ratio of the number of reviews that the reviewer wrote which were the first reviews (F22) • ratio of the number of cases in which he/she was the only reviewer (F23) • average rating given by reviewer (F24) • standard deviation in rating (F25) • the reviewer always gave only good, average or bad rating (F26) • a reviewer gave both good and bad ratings (F27) • a reviewer gave good rating and average rating (F28) • a reviewer gave bad rating and average rating (F29) • a reviewer gave all three ratings (F30) • percent of times that a reviewer wrote a review with binary features F20 (F31) and F21 (F32). 14
  • 15. Product Centric Features • Price of the product (F33) • Sales rank of the product (F34) • Average rating (F35) • standard deviation in ratings (F36) 15
  • 16. Results of Type 2 and Type 3 Spam Detection • logistic regression on the data using 470 spam reviews for positive class and rest of the reviews for negative class. • 10-fold cross validation. 16
  • 17. Analysis of Type 1 Spam Reviews • 1. To promote some target objects (hype spam). • 2. To damage the reputation of some other target objects (defaming spam). 17
  • 18. Making Use of Duplicates • the same person writes the same review for different versions of the same product may not be spam. • We propose to treat all duplicate spam reviews as positive examples, and the rest of the reviews as negative examples. • We then use them to learn a model to identify non-duplicate reviews with similar characteristics, which are likely to be spam reviews. 18
  • 19. Model Building Using Duplicates • building the logistic regression model using duplicates and non-duplicates is not for detecting duplicate spam • Our real purpose is to use the model to identify type 1 spam reviews that are not duplicated. • check whether it can predict outlier reviews 19
  • 20. • Outlier reviews : whose ratings deviate from the average product rating a great deal. • Sentiment classification techniques may be used to automatically assign a rating to a review solely based on its review content. 20
  • 21. Predicting Outlier Reviews • Negative deviation is considered as less than -1 from the mean rating and positive deviation as larger than +1 from the mean rating of the product. • Spammers may not want their review ratings to deviate too much from the norm to make the reviews too suspicious. 21
  • 22. 22 X% of reviews (the test data) cumulated percentage of reviews of the current bin
  • 23. Some Other Interesting Reviews - Only Reviews • We did not use any position features of reviews (F6, F7, F8, F9, F20 and F21) and number of reviews of product (F12, F18, F35, and F36) related features in model building to prevent overfitting. • only reviews are very likely to be candidates of spam 23
  • 24. - Reviews from Top-Ranked Reviewers • Top-ranked reviewers generally write a large number of reviews, much more than bottom ranked reviewers. • Top ranked reviewers also score high on some important indicators of spam reviews.  top ranked reviewers are less trustworthy as compared to bottom ranked reviewers. 24
  • 25. - Reviews with Different Levels of Feedbacks • If usefulness of a review is defined based on the feedbacks that the review gets, it means that people can be readily fooled by a spam review. feedback spam 25
  • 26. - Reviews of Products with Varied Sales Ranks • Spam activities are more limited to low selling products.  difficult to damage reputation of a high selling or popular product by writing a spam review. 26
  • 27. Conclusions & Future work • Results showed that the logistic regression model is highly effective. • It is very hard to manually label training examples for type 1 spam. • We will further improve the detection methods, and also look into spam in other kinds of media, e.g., forums and blogs. 27

Hinweis der Redaktion

  1. We consider them as spam because they are not targeted at the specific products and are often biased.
  2. &amp;lt;number&amp;gt;
  3. We can use the probability to weight each review. Since the probability reflects the likelihood that a review is a spam, those reviews with high probabilities can be weighted down to reduce their effects on opinion mining.
  4. 1.3.5 write by owners or manufacturers. 2.4.6 write by competititors 1.4 are not damaging, 2.3.5.6 are very harmful. spam detection techniques should focus on identifying reviews in these regions.
  5. Since the number of such duplicate spam reviews is quite large, it is reasonable to assume that they are a fairly good and random sample of many kinds of spam.
  6. To further confirm its predictability, we want to show that it is also predictive of reviews that are more likely to be spam and are not duplicated, i.e., outlier reviews. They are more likely to be spam reviews than average reviews because high rating deviation is a necessary condition for a harmful spam review but not sufficient because some reviewers may truly have different views from others.