SlideShare a Scribd company logo
1 of 21
Download to read offline
Making the Most of Tweet-Inherent Features for
Social Spam Detection on Twitter
Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter
Department of Computer Science
University of Warwick
18th May 2015
Social Spam on Twitter
Motivation
• Social spam is an important issue in social media services
such as Twitter, e.g.:
• Users inject tweets in trending topics.
• Users reply with promotional messages providing a link.
• We want to be able to identify these spam tweets in a
Twitter stream.
Social Spam on Twitter
How Did we Feel the Need to Identify Spam?
• We started tracking events via streaming API.
• They were often riddled with noisy tweets.
Social Spam on Twitter
Example
Social Spam on Twitter
Our Approach
• Detection of spammers: unsuitable, we couldn’t
aggregate a user’s data from a stream.
• Alternative solution: Determine if tweet is spam from its
inherent features.
Social Spam on Twitter
Definitions
• Spam originally coined for unsolicited email.
• How to define spam for Twitter? (not easy!)
• Twitter has own definition of spam, where certain level of
advertisements is allowed:
• It rather refers to the user level rather than tweet level, e.g.,
users who massively follow others.
• Harder to define a spam than a spammer.
Social Spam on Twitter
Our Definition
• Twitter spam: noisy content produced by users who
express a different behaviour from what the system is
intended for, and has the goal of grabbing attention by
exploiting the social media service’s characteristics.
Spammer vs. Spam Detection
What Did Others Do?
• Most previous work focused on spammer detection (users).
• They used features which are not readily available in a
tweet:
• For example, historical user behaviour and network
features.
• Not feasible for our use.
Spammer vs. Spam Detection
What Do We Want To Do Instead?
• (Near) Real-time spam detection, limited to features
readily available in a stream of tweets.
• Contributions:
• Test on two existing datasets, adapted to our purposes.
• Definition of different feature sets.
• Compare different classification algorithms.
• Investigate the use of different tweet-inherent features.
Datasets
• We relied on two (spammer vs non-spammer) datasets:
• Social Honeypot (Lee et al., 2011 [1]): used social honeypots
to attract spammers.
• 1KS-10KN (Yang et al., 2011 [2]): harvested tweets
containing certain malicious URLs.
• Spammer dataset to our spam dataset: Randomly select
one tweet from each spammer or legitimate user.
• Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1).
• 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).
Feature Engineering
User features Content features
Length of profile name Number of words
Length of profile description Number of characters
Number of followings (FI) Number of white spaces
Number of followers (FE) Number of capitalization words
Number of tweets posted Number of capitalization words per word
Age of the user account, in hours (AU) Maximum word length
Ratio of number of followings and followers (FE/FI) Mean word length
Reputation of the user (FE/(FI + FE)) Number of exclamation marks
Following rate (FI/AU) Number of question marks
Number of tweets posted per day Number of URL links
Number of tweets posted per week Number of URL links per word
N-grams Number of hashtags
Uni + bi-gram or bi + tri-gram Number of hashtags per word
Number of mentions
Sentiment features Number of mentions per word
Automatically created sentiment lexicons Number of spam words
Manually created sentiment lexicons Number of spam words per word
Part of speech tags of every tweet
Evaluation
Experiment Settings
• 5 widely-used classification algorithms: Bernoulli Naive
Bayes, KNN, SVM, Decision Tree and Random Forests.
• Hyperparameters optimised from a subset of the dataset
separate from train/test sets.
• All 4 feature sets were combined.
• 10-fold cross-validation.
Evaluation
Selection of Classifier
Classifier
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F1-measure
Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789
KNN 0.924 0.706 0.798 0.802 0.778 0.790
SVM 0.872 0.708 0.780 0.844 0.817 0.830
Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915
Random Forest 0.993 0.716 0.831 0.941 0.950 0.946
• Random Forests outperform others in terms of
F1-measure and Precision.
• Better performance on Social Honeypot (1:1 ratio rather
than 1:10?).
• Results only 4% below original papers, which require
historic user features.
Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F-measure
User features (U) 0.895 0.709 0.791 0.938 0.940 0.940
Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762
Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743
Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775
Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775
Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720
Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702
• Testing feature sets one by one:
• User features (U) most determinant for Social Honeypot.
• N-gram features best for 1KS-10KN.
• Potentially due to diff. dataset generation approaches?
Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F-measure
Single feature set 0.943 0.726 0.820 0.938 0.940 0.940
U + C 0.974 0.708 0.819 0.938 0.949 0.943
U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943
U + S 0.948 0.732 0.825 0.940 0.944 0.942
Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770
C + S 0.970 0.649 0.777 0.778 0.762 0.770
C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770
U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941
U + C + S 0.982 0.704 0.819 0.937 0.948 0.942
U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937
C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782
U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942
• However, when we combine feature sets:
• The same approach performs best (F1) for both: U + Bi &
Tri-gram (Tf).
• Combining features helps us capture diff. types of spam
tweets.
Evaluation
Computational Efficiency
• Beyond accuracy, how can all these features be applied
efficiently in a stream?
Evaluation
Computational Efficiency
Feature set
Comp. time (seconds)
for 1k tweets
User features 0.0057
N-gram 0.3965
Sentiment features 20.9838
Number of spam words (NSW) 19.0111
Part-of-speech counts (POS) 0.6139
Content features including NSW and POS 20.2367
Content features without NSW 1.0448
Content features without POS 19.6165
• Tested on regular computer (2.8 GHz Intel Core i7 processor
and 16 GB memory).
• The features that performed best in combination (User
and N-grams) are those most efficiently calculated.
Conclusion
• Random Forests were found to be the most accurate
classifier.
• Comparable performance to previous work (-4%) while
limiting features to those in a tweet.
• The use of multiple feature sets increases the possibility
to capture different spam types, and makes it more
difficult for spammers to evade.
• Diff. features perform better when used separately, but
same features are useful when combined.
Future Work
• Spam corpus constructed by picking tweets from
spammers.
• Need to study if legitimate users also likely to post spam
tweets, and how it could affect the results.
• A more recent, manually labelled spam/non-spam
dataset.
• Feasibility of cross-dataset spam classification?
That’s it!
• Any Questions?
K. Lee, B. D. Eoff, and J. Caverlee.
Seven months with the devils: A long-term study of content
polluters on twitter.
In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors,
ICWSM. The AAAI Press, 2011.
C. Yang, R. C. Harkreader, and G. Gu.
Die free or live hard? empirical evaluation and new design for
fighting evolving twitter spammers.
In Proceedings of the 14th International Conference on Recent
Advances in Intrusion Detection, RAID’11, pages 318–337,
Berlin, Heidelberg, 2011. Springer-Verlag.

More Related Content

What's hot

Spammer detection and fake user Identification on Social Networks
Spammer detection and fake user Identification on Social NetworksSpammer detection and fake user Identification on Social Networks
Spammer detection and fake user Identification on Social NetworksJAYAPRAKASH JPINFOTECH
 
Hr salary prediction using ml
Hr salary prediction using mlHr salary prediction using ml
Hr salary prediction using mlshaiksafi1
 
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)SSII
 
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...Simplilearn
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methodsProf.Nilesh Magar
 
Social media data analytics
Social media data analyticsSocial media data analytics
Social media data analyticsAujaswiAgarwal1
 
Graph Neural Networks for Recommendations
Graph Neural Networks for RecommendationsGraph Neural Networks for Recommendations
Graph Neural Networks for RecommendationsWQ Fan
 
Face recognition ppt
Face recognition pptFace recognition ppt
Face recognition pptSantosh Kumar
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 
Fake news detection project
Fake news detection projectFake news detection project
Fake news detection projectHarshdaGhai
 
sentiment analysis text extraction from social media
sentiment  analysis text extraction from social media sentiment  analysis text extraction from social media
sentiment analysis text extraction from social media Ravindra Chaudhary
 
Phishing Detection using Machine Learning
Phishing Detection using Machine LearningPhishing Detection using Machine Learning
Phishing Detection using Machine LearningArjun BM
 
Monitoring Dual Stack IPv4/IPv6 Networks
Monitoring Dual Stack IPv4/IPv6 NetworksMonitoring Dual Stack IPv4/IPv6 Networks
Monitoring Dual Stack IPv4/IPv6 NetworksVance Shipley
 
Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Mounia Lalmas-Roelleke
 
Social Impacts & Trends of Data Mining
Social Impacts & Trends of Data MiningSocial Impacts & Trends of Data Mining
Social Impacts & Trends of Data MiningSushilDhakal4
 

What's hot (20)

Spammer detection and fake user Identification on Social Networks
Spammer detection and fake user Identification on Social NetworksSpammer detection and fake user Identification on Social Networks
Spammer detection and fake user Identification on Social Networks
 
Email spam detection
Email spam detectionEmail spam detection
Email spam detection
 
Hr salary prediction using ml
Hr salary prediction using mlHr salary prediction using ml
Hr salary prediction using ml
 
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
 
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
 
Social media data analytics
Social media data analyticsSocial media data analytics
Social media data analytics
 
Captcha
CaptchaCaptcha
Captcha
 
Graph Neural Networks for Recommendations
Graph Neural Networks for RecommendationsGraph Neural Networks for Recommendations
Graph Neural Networks for Recommendations
 
Social Media Sentiment Analysis
Social Media Sentiment AnalysisSocial Media Sentiment Analysis
Social Media Sentiment Analysis
 
Sms spam classification
Sms spam classificationSms spam classification
Sms spam classification
 
Face recognition ppt
Face recognition pptFace recognition ppt
Face recognition ppt
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Fake news detection project
Fake news detection projectFake news detection project
Fake news detection project
 
sentiment analysis text extraction from social media
sentiment  analysis text extraction from social media sentiment  analysis text extraction from social media
sentiment analysis text extraction from social media
 
Phishing Detection using Machine Learning
Phishing Detection using Machine LearningPhishing Detection using Machine Learning
Phishing Detection using Machine Learning
 
Monitoring Dual Stack IPv4/IPv6 Networks
Monitoring Dual Stack IPv4/IPv6 NetworksMonitoring Dual Stack IPv4/IPv6 Networks
Monitoring Dual Stack IPv4/IPv6 Networks
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Metrics, Engagement & Personalization
Metrics, Engagement & Personalization
 
Social Impacts & Trends of Data Mining
Social Impacts & Trends of Data MiningSocial Impacts & Trends of Data Mining
Social Impacts & Trends of Data Mining
 

Viewers also liked

Detecting Spammers on Social Networks
Detecting Spammers on Social NetworksDetecting Spammers on Social Networks
Detecting Spammers on Social NetworksGianluca Stringhini
 
Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013Carlos Laorden
 
12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for you12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for youOnline Promotion Success, Inc.
 
Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionSOYEON KIM
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentationnewsan2001
 
Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.Ambarish Pande
 

Viewers also liked (11)

Spam Filtering
Spam FilteringSpam Filtering
Spam Filtering
 
Detecting Spammers on Social Networks
Detecting Spammers on Social NetworksDetecting Spammers on Social Networks
Detecting Spammers on Social Networks
 
Spam, security
Spam, securitySpam, security
Spam, security
 
Twitter Spam
Twitter SpamTwitter Spam
Twitter Spam
 
Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013
 
12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for you12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for you
 
Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS Detection
 
Bulk sms
Bulk smsBulk sms
Bulk sms
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentation
 
Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.
 
Spam
SpamSpam
Spam
 

Similar to Microposts2015 - Social Spam Detection on Twitter

SplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunk
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsGiulio Carducci
 
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...LINE Corp.
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsYuanyuan Tian
 
A flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVA flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVIntoTheMinds
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVFrancisco Couto
 
Feature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsFeature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsRavi Kiran Holur Vijay
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmVaibhav Varshney
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...yeung2000
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsRobin Reni
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Jeffrey Nichols
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
A Two Step Ranking Solution for Twitter User Engagement
A Two Step Ranking Solution for Twitter User Engagement�A Two Step Ranking Solution for Twitter User Engagement�
A Two Step Ranking Solution for Twitter User EngagementBehnoush Abdollahi
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Pete Burnap
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)Amazon Web Services
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...Tuan Hoang
 

Similar to Microposts2015 - Social Spam Detection on Twitter (20)

SplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP Interactive
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on Microblogs
 
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
 
A flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVA flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TV
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TV
 
Feature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsFeature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon Reviews
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Two Step Ranking Solution for Twitter User Engagement
A Two Step Ranking Solution for Twitter User Engagement�A Two Step Ranking Solution for Twitter User Engagement�
A Two Step Ranking Solution for Twitter User Engagement
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
 

More from azubiaga

Exploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaExploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaazubiaga
 
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social MediaCrowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social Mediaazubiaga
 
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social NewsgatheringCurating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social Newsgatheringazubiaga
 
Mining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information DiscoveryMining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information Discoveryazubiaga
 
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...azubiaga
 
Harnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource ClassificationHarnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource Classificationazubiaga
 
Clasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones SocialesClasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones Socialesazubiaga
 
Content-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud VisualizationContent-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud Visualizationazubiaga
 
Getting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page ClassificationGetting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page Classificationazubiaga
 
Enhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social TagsEnhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social Tagsazubiaga
 
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?azubiaga
 
Etiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzeaEtiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzeaazubiaga
 
Master thesis presentation
Master thesis presentationMaster thesis presentation
Master thesis presentationazubiaga
 
Tags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social ClassificationTags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social Classificationazubiaga
 

More from azubiaga (14)

Exploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaExploiting context for rumour detection in social media
Exploiting context for rumour detection in social media
 
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social MediaCrowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
 
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social NewsgatheringCurating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
 
Mining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information DiscoveryMining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information Discovery
 
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
 
Harnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource ClassificationHarnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource Classification
 
Clasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones SocialesClasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones Sociales
 
Content-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud VisualizationContent-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud Visualization
 
Getting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page ClassificationGetting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page Classification
 
Enhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social TagsEnhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social Tags
 
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
 
Etiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzeaEtiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzea
 
Master thesis presentation
Master thesis presentationMaster thesis presentation
Master thesis presentation
 
Tags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social ClassificationTags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social Classification
 

Recently uploaded

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

Microposts2015 - Social Spam Detection on Twitter

  • 1. Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter Department of Computer Science University of Warwick 18th May 2015
  • 2. Social Spam on Twitter Motivation • Social spam is an important issue in social media services such as Twitter, e.g.: • Users inject tweets in trending topics. • Users reply with promotional messages providing a link. • We want to be able to identify these spam tweets in a Twitter stream.
  • 3. Social Spam on Twitter How Did we Feel the Need to Identify Spam? • We started tracking events via streaming API. • They were often riddled with noisy tweets.
  • 4. Social Spam on Twitter Example
  • 5. Social Spam on Twitter Our Approach • Detection of spammers: unsuitable, we couldn’t aggregate a user’s data from a stream. • Alternative solution: Determine if tweet is spam from its inherent features.
  • 6. Social Spam on Twitter Definitions • Spam originally coined for unsolicited email. • How to define spam for Twitter? (not easy!) • Twitter has own definition of spam, where certain level of advertisements is allowed: • It rather refers to the user level rather than tweet level, e.g., users who massively follow others. • Harder to define a spam than a spammer.
  • 7. Social Spam on Twitter Our Definition • Twitter spam: noisy content produced by users who express a different behaviour from what the system is intended for, and has the goal of grabbing attention by exploiting the social media service’s characteristics.
  • 8. Spammer vs. Spam Detection What Did Others Do? • Most previous work focused on spammer detection (users). • They used features which are not readily available in a tweet: • For example, historical user behaviour and network features. • Not feasible for our use.
  • 9. Spammer vs. Spam Detection What Do We Want To Do Instead? • (Near) Real-time spam detection, limited to features readily available in a stream of tweets. • Contributions: • Test on two existing datasets, adapted to our purposes. • Definition of different feature sets. • Compare different classification algorithms. • Investigate the use of different tweet-inherent features.
  • 10. Datasets • We relied on two (spammer vs non-spammer) datasets: • Social Honeypot (Lee et al., 2011 [1]): used social honeypots to attract spammers. • 1KS-10KN (Yang et al., 2011 [2]): harvested tweets containing certain malicious URLs. • Spammer dataset to our spam dataset: Randomly select one tweet from each spammer or legitimate user. • Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1). • 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).
  • 11. Feature Engineering User features Content features Length of profile name Number of words Length of profile description Number of characters Number of followings (FI) Number of white spaces Number of followers (FE) Number of capitalization words Number of tweets posted Number of capitalization words per word Age of the user account, in hours (AU) Maximum word length Ratio of number of followings and followers (FE/FI) Mean word length Reputation of the user (FE/(FI + FE)) Number of exclamation marks Following rate (FI/AU) Number of question marks Number of tweets posted per day Number of URL links Number of tweets posted per week Number of URL links per word N-grams Number of hashtags Uni + bi-gram or bi + tri-gram Number of hashtags per word Number of mentions Sentiment features Number of mentions per word Automatically created sentiment lexicons Number of spam words Manually created sentiment lexicons Number of spam words per word Part of speech tags of every tweet
  • 12. Evaluation Experiment Settings • 5 widely-used classification algorithms: Bernoulli Naive Bayes, KNN, SVM, Decision Tree and Random Forests. • Hyperparameters optimised from a subset of the dataset separate from train/test sets. • All 4 feature sets were combined. • 10-fold cross-validation.
  • 13. Evaluation Selection of Classifier Classifier 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F1-measure Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789 KNN 0.924 0.706 0.798 0.802 0.778 0.790 SVM 0.872 0.708 0.780 0.844 0.817 0.830 Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915 Random Forest 0.993 0.716 0.831 0.941 0.950 0.946 • Random Forests outperform others in terms of F1-measure and Precision. • Better performance on Social Honeypot (1:1 ratio rather than 1:10?). • Results only 4% below original papers, which require historic user features.
  • 14. Evaluation Evaluation of Features (w/ Random Forests) Feature Set 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F-measure User features (U) 0.895 0.709 0.791 0.938 0.940 0.940 Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762 Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743 Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775 Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775 Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720 Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702 • Testing feature sets one by one: • User features (U) most determinant for Social Honeypot. • N-gram features best for 1KS-10KN. • Potentially due to diff. dataset generation approaches?
  • 15. Evaluation Evaluation of Features (w/ Random Forests) Feature Set 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F-measure Single feature set 0.943 0.726 0.820 0.938 0.940 0.940 U + C 0.974 0.708 0.819 0.938 0.949 0.943 U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943 U + S 0.948 0.732 0.825 0.940 0.944 0.942 Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770 C + S 0.970 0.649 0.777 0.778 0.762 0.770 C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770 U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941 U + C + S 0.982 0.704 0.819 0.937 0.948 0.942 U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937 C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782 U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942 • However, when we combine feature sets: • The same approach performs best (F1) for both: U + Bi & Tri-gram (Tf). • Combining features helps us capture diff. types of spam tweets.
  • 16. Evaluation Computational Efficiency • Beyond accuracy, how can all these features be applied efficiently in a stream?
  • 17. Evaluation Computational Efficiency Feature set Comp. time (seconds) for 1k tweets User features 0.0057 N-gram 0.3965 Sentiment features 20.9838 Number of spam words (NSW) 19.0111 Part-of-speech counts (POS) 0.6139 Content features including NSW and POS 20.2367 Content features without NSW 1.0448 Content features without POS 19.6165 • Tested on regular computer (2.8 GHz Intel Core i7 processor and 16 GB memory). • The features that performed best in combination (User and N-grams) are those most efficiently calculated.
  • 18. Conclusion • Random Forests were found to be the most accurate classifier. • Comparable performance to previous work (-4%) while limiting features to those in a tweet. • The use of multiple feature sets increases the possibility to capture different spam types, and makes it more difficult for spammers to evade. • Diff. features perform better when used separately, but same features are useful when combined.
  • 19. Future Work • Spam corpus constructed by picking tweets from spammers. • Need to study if legitimate users also likely to post spam tweets, and how it could affect the results. • A more recent, manually labelled spam/non-spam dataset. • Feasibility of cross-dataset spam classification?
  • 20. That’s it! • Any Questions?
  • 21. K. Lee, B. D. Eoff, and J. Caverlee. Seven months with the devils: A long-term study of content polluters on twitter. In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors, ICWSM. The AAAI Press, 2011. C. Yang, R. C. Harkreader, and G. Gu. Die free or live hard? empirical evaluation and new design for fighting evolving twitter spammers. In Proceedings of the 14th International Conference on Recent Advances in Intrusion Detection, RAID’11, pages 318–337, Berlin, Heidelberg, 2011. Springer-Verlag.