SlideShare ist ein Scribd-Unternehmen logo
1 von 25
A Two Step Ranking Solution 
for Twitter User Engagement 
Behnoush Abdollahi, Mahsa Badami, Gopi Chand 
Nutakki, Wenlong Sun, Olfa Nasraoui 
Knowledge Discovery and Web Mining Lab 
University of Louisville 
http://webmining.spd.louisville.edu 
Challenge@Recsys 2014 1
Outline 
• Introduction 
• Challenges 
• Summary of our approach 
• Preprocessing 
• Two step ranking model 
• Neighborhood-based Repairing for Tweets with 
Predicted Zero Engagement 
• Results 
• Lessons learned 
• Conclusion 
Challenge@Recsys 2014 2
Introduction 
• Data: extended version of the MovieTweetings 
dataset 
o collected from the users of the IMDb iOS app that rate movies and share 
the rating on Twitter 
• Predicting user engagement for: 
o #favorites 
o #retweets 
• Evaluation: nDCG@10 metric 
learning to rank approach 
data statistics: 
Challenge@Recsys 2014 3
Challenges 
• High dimensionality 
• Power law distribution of the user engagement 
• Missing values 
• Outliers 
• Imbalanced engagement distribution 
o more than 95% of engagement = zero 
• Different from many standard prediction and 
recommendation problems: 
o user egagement may be affected by 
• implicit profile information of the user on Twitter 
• movie preference data by other users 
• movie content data 
Challenge@Recsys 2014 4
Summary of our approach 
Challenge@Recsys 2014 5
Preprocessing 
• Data cleaning and completion 
o Removing 
• special characters from text-based features, 
• empty, 
• redundant, 
• invalid values such as movie ratings exceeding 10 
o Filling missing data such as user id by looking at the 
nearest tweets 
o Using country code as a location feature 
• if missing  converted other similar geographical 
features to the country code 
Challenge@Recsys 2014 6
Preprocessing 
• Feature Extraction + Engineering 
from Twitter + IMDb 
Features 
Context-based 
Text-based 
Tweet-Movie 
similarity 
Challenge@Recsys 2014 7
Context-based features 
o User Profile(Twitter) 
• # of user’s followers/friends 
• # of users that are mentioned/ replied to by the current user 
• same features for each re-tweeted tweet 
o Movie Profile (IMDb) 
• imdb movie plot, director, actors, genre, languages, countries 
• # of times in which a movie has been tagged in a tweet 
• average rating for a movie 
• Movie total re-tweet/favorite count number of users who have 
rated a particular movie 
o Twitter Profile 
• tweet/retweet flag 
• time delay between the tweet and movie release date 
• seasonality (Christmas, Halloween time,…) 
• time features: certain times of the day (day or night), certain days 
of the week (week days or weekends) 
Challenge@Recsys 2014 8
Text-based features 
description 
hashtag 
movie plot 
movie 
genre 
user 
extracted bag of words, then select the 
most relevant based on Mutual 
Information Gain Ratio with target 
Challenge@Recsys 2014 9
Tweet-Movie Similarity 
• Using common lower dimensional latent space 
• Joint latent space is learned 
o using NMF 
o to capture the similarity between a tweet and the movie that it 
mentions 
o to handle the problems of: 
• sparsity, 
• high dimensionality in bag of word features, 
• poor semantics 
o based on tweet & movie features, such as: 
• hashtags, 
• user description, 
• movie genres, 
• movie plot, ... 
Challenge@Recsys 2014 
10
Tweet-Movie Similarity 
1. Building a semantic tweet representation by factoring the tweet matrix X1: 
X1(n1 ´m1) = A1(n1 ´ f1)´B1( f1 ´m1) 
where: n1 = #tweets, m1 = #features and f1 = #factors 
Tweet 
s 
Words 
Tweet 
s 
Latent Factors 
Latent 
Factors 
Words 
x 
2. Mapping the movie data X2 into the latent space defined by B1 to compute the 
movie coefficient matrix A2 
X2(n2 ´m1) = A2 (n2 ´ f1)´B1( f1 ´m1) 
where: n2 = #movies 
3. Computing the similarity using the dot product of the corresponding rows of A1 
and A2 
sim(X,X ) = A´ AT 
11 2 
2 Challenge@Recsys 2014 11
Two step ranking model 
1. Cost sensitive classifier 
o Classify tweets into zero engagement and multiple non-zero 
classes 
o Imbalanced data 
 used cost sensitive classification 
2. Ranker 
o List-wise, point-wise, pair-wise approaches 
o Predict the relevance values in Information Retrieval 
tasks 
o Engagement values considered as grades/labels 
• y =1, 2,…, l 
Challenge@Recsys 2014 12
Cost sensitive classifier 
• More than 95% of the data have zero engagement value  
o classical classification methods tend to misclassify the minority class 
(non-zeros engagements ) as the majority class (zero engagements) 
• Cost sensitive framework 
o assign different weights to errors of type I and type II 
o assign higher weights to tweets classified in Zero engagement 
class 
o use cost matrix C, for a sample x in class i, the following Loss 
function is minimized to find p: 
j å 
C(i,j) = the cost of predicting 
class i when the true class is j 
• Tweets classified as non-zero 
engagement are then passed to the ranker 
actual negative actual positive 
predicted 
negative 
C(0,0) C(0,1) 
predicted 
positive 
C(1,0) C(1,1) 
L(x, i) = P( j | x)C(i, j) 
Challenge@Recsys 2014 13
Ranker 
• Let y the grade/ label set be: 
• Let twitter_user_ids: and 
y =1,2,..., l 
u1,u2,...,um 
a set of tweets associated with user ui 
Ti = ti,1, ti,2,..., ti,ni 
then y= y, y,..., yis the set of grades associated with user u(nis 
i i,1i,2i,ni 
i i the size of Tand y) 
i i• The classified training set is denoted as: 
i, j x =F( 
i u, 
S = {(ui,Ti ), 
m yi 
} 
i=1 
• A feature vector is generated from each group-tweet 
pair 
i u, 
• Our goal: to train ranking models which can assign a score to a 
given pair 
• Used Random Forrest (RF) 
o uses independent subsets 
o parallelizable, robust to noisy data, capable of learning 
disjunctive expressions 
i, j t ) 
( 
i, j t ) 
Challenge@Recsys 2014 14
Neighborhood-based Repairing for 
Tweets with Predicted Zero 
Engagement 
non-zero 
engagment 
non-zero 
engagment 
zero 
engagment 
zero 
engagment 
Challenge@Recsys 2014 15
Neighborhood-based 
Repairing for Tweets with 
Predicted Zero-Engagement 
• Goal: to correct predictions of non-zero tweets 
misclassified as the zero engagement tweets 
• Added neighborhood based approach after step 1: 
1. Compute similarity between training and test tweets in 
common Latent space (computed using NMF) 
2. Find the NT nearest tweets in the training for each test tweet 
classified as zero-engagement 
3. Reassign the predicted engagement 
Challenge@Recsys 2014 16
Neighborhood-based 
Repairing for Tweets with 
Predicted Zero Engagement 
• Varied neighborhood size: NT= 5,10 or 20 
• Used two options to reassign predicted engagements: 
1. Predict non-zero engagement If either of the neighbors’ 
engagements is non-zero 
2. Predict engagement = the most frequent engagement 
value among neighbors 
• Select a margin based on cosine similarity to consider 
a set of zero-predicted-engagement test tweets to 
become candidates to be repaired 
o  move to non-zero predicted engagement 
Challenge@Recsys 2014 17
Steps in Process 
classification 
refine 
classification 
ranking the 
non-zero 
tweets 
merge the 
zeros with 
the ranked 
non-zeros 
tweets 
Challenge@Recsys 2014 18
Results 
• Step 1- Classifier 
o Used Weka with a cost sensitive framework 
o The engagement was discretized into 6 classes 
class 1 • only 0s 
class 2 • only 1s 
class 3 • 2-10 
class 4 • 11-20 
class 5 • 21-50 
class 6 • values > 50 
o Adaboost gave the best result 
• 99% classification accuracy 
• true positive rate = 0.74 in the minority class (non-zero engagements) 
• false positive rate = 0.01 
• Step 2- Ranker 
o Built global ranking model 
o Used RankLib implemented in Java 
o Engagement is considered the target value to be ranked 
Challenge@Recsys 2014 19
Results 
• nDCG@10 for the test data. 
• Ranker applied indiscriminately on both zero and 
non-zero class tweets. 
• This result is only shown to appreciate the impact of 
the classier in Step 1. 
Ranking 
Algorithm 
All features Excluding 
IMDb 
Features 
Excluding 
graph-propagated 
features 
Random 
Forest 
0.553 0.503 0.485 
LambdaMAR 
T 
0.466 0.422 0.411 
RankBoost 0.432 0.417 0.406 
Challenge@Recsys 2014 20
Results 
• Merging the zero and non-zero predicted 
engagement to calculate nDCG@10 
– by appending the zero-class tweets at the end of 
the non-zero tweets for each user and then 
sorting the tweets again based on the user id 
Ranking 
Algorithm 
All features Excluding IMDb 
Features 
Excluding graph-propagated 
features 
Random Forest 0.805 0.503 0.485 
LambdaMART 0.466 0.422 0.411 
RankBoost 0.432 0.417 0.406 
Challenge@Recsys 2014 21
Results 
• Effect of repairing the predicted zero engagement tweets on 
nDCG@10 values for varying margin sizes defined by different 
similarity thresholds and neighborhood sizes 
• varied the neighborhood size, NT= 5,10,15 
• for NT = 10 nearest neighbors and a similarity threshold of 0.9 to 
confine the margin where tweets are repaired 
 increased the nDCG@10 value to 0.817 
Challenge@Recsys 2014 22
Lessons Learned 
• What helped 
o Adding additional sources of data (IMDb). 
o Dedicated janitorial work on the data 
o Feature Engineering: selection, extraction, and 
construction of relevant features. 
o Mapping data (tweets and movies) to a Latent Factor 
Space using NMF 
o Binary classification of zero and non-zero engagement 
tweets prior to ranking. 
o Cost-sensitive classification in Step 1 
o Using Learning to Rank (LTR) methods in Step 2. 
o Repairing misclassified non-zero engagement tweets within 
a limited margin, and then re-ranking them. 
Challenge@Recsys 2014 23
Lessons Learned 
• What hurt 
o Over or under-sampling to handle class 
imbalance 
o Spending too much time on filling missing values 
for many features did not have any return on 
investment 
o Not having the luxury of better computational 
power 
o Building separate models based on whether or 
not the users and movies were common 
between the training and test sets 
Challenge@Recsys 2014 24
Conclusion 
• If more time or computational power were within our 
reach, we would further explore several directions: 
o Exploring other LTR options in addition to the 
pointwise approach, including pairwise and 
listwise LTR. 
o Domain-informed Transductive learning 
o Exploring additional extracted or constructed 
features that may affect engagement. 
o Exploring high performance computing to 
compute many embarrassably parallellizable 
tasks. 
Challenge@Recsys 2014 25

Weitere ähnliche Inhalte

Ähnlich wie A Two Step Ranking Solution for Twitter User Engagement

DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...Hakka Labs
 
Twitter Agreement Analysis
Twitter Agreement AnalysisTwitter Agreement Analysis
Twitter Agreement AnalysisArvind Krishnaa
 
Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitterazubiaga
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 
microposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdfmicroposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdfSunnySam26
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsUniversity of Bergen
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
Using transfer learning for video popularity prediction
Using transfer learning for video popularity predictionUsing transfer learning for video popularity prediction
Using transfer learning for video popularity predictioneSAT Publishing House
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Retweet Prediction with Attention-based Deep Neural Network
Retweet Prediction with Attention-based Deep Neural NetworkRetweet Prediction with Attention-based Deep Neural Network
Retweet Prediction with Attention-based Deep Neural NetworkGUANGYUAN PIAO
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning SystemsXavier Amatriain
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptxSaravanaD2
 
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting RatingsSemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings Matthew Rowe
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesAlan Said
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 

Ähnlich wie A Two Step Ranking Solution for Twitter User Engagement (20)

DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
 
Twitter Agreement Analysis
Twitter Agreement AnalysisTwitter Agreement Analysis
Twitter Agreement Analysis
 
Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitter
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
microposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdfmicroposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdf
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender Systems
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Using transfer learning for video popularity prediction
Using transfer learning for video popularity predictionUsing transfer learning for video popularity prediction
Using transfer learning for video popularity prediction
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Retweet Prediction with Attention-based Deep Neural Network
Retweet Prediction with Attention-based Deep Neural NetworkRetweet Prediction with Attention-based Deep Neural Network
Retweet Prediction with Attention-based Deep Neural Network
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptx
 
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting RatingsSemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 

Kürzlich hochgeladen

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsNurulAfiqah307317
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 

Kürzlich hochgeladen (20)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 

A Two Step Ranking Solution for Twitter User Engagement

  • 1. A Two Step Ranking Solution for Twitter User Engagement Behnoush Abdollahi, Mahsa Badami, Gopi Chand Nutakki, Wenlong Sun, Olfa Nasraoui Knowledge Discovery and Web Mining Lab University of Louisville http://webmining.spd.louisville.edu Challenge@Recsys 2014 1
  • 2. Outline • Introduction • Challenges • Summary of our approach • Preprocessing • Two step ranking model • Neighborhood-based Repairing for Tweets with Predicted Zero Engagement • Results • Lessons learned • Conclusion Challenge@Recsys 2014 2
  • 3. Introduction • Data: extended version of the MovieTweetings dataset o collected from the users of the IMDb iOS app that rate movies and share the rating on Twitter • Predicting user engagement for: o #favorites o #retweets • Evaluation: nDCG@10 metric learning to rank approach data statistics: Challenge@Recsys 2014 3
  • 4. Challenges • High dimensionality • Power law distribution of the user engagement • Missing values • Outliers • Imbalanced engagement distribution o more than 95% of engagement = zero • Different from many standard prediction and recommendation problems: o user egagement may be affected by • implicit profile information of the user on Twitter • movie preference data by other users • movie content data Challenge@Recsys 2014 4
  • 5. Summary of our approach Challenge@Recsys 2014 5
  • 6. Preprocessing • Data cleaning and completion o Removing • special characters from text-based features, • empty, • redundant, • invalid values such as movie ratings exceeding 10 o Filling missing data such as user id by looking at the nearest tweets o Using country code as a location feature • if missing  converted other similar geographical features to the country code Challenge@Recsys 2014 6
  • 7. Preprocessing • Feature Extraction + Engineering from Twitter + IMDb Features Context-based Text-based Tweet-Movie similarity Challenge@Recsys 2014 7
  • 8. Context-based features o User Profile(Twitter) • # of user’s followers/friends • # of users that are mentioned/ replied to by the current user • same features for each re-tweeted tweet o Movie Profile (IMDb) • imdb movie plot, director, actors, genre, languages, countries • # of times in which a movie has been tagged in a tweet • average rating for a movie • Movie total re-tweet/favorite count number of users who have rated a particular movie o Twitter Profile • tweet/retweet flag • time delay between the tweet and movie release date • seasonality (Christmas, Halloween time,…) • time features: certain times of the day (day or night), certain days of the week (week days or weekends) Challenge@Recsys 2014 8
  • 9. Text-based features description hashtag movie plot movie genre user extracted bag of words, then select the most relevant based on Mutual Information Gain Ratio with target Challenge@Recsys 2014 9
  • 10. Tweet-Movie Similarity • Using common lower dimensional latent space • Joint latent space is learned o using NMF o to capture the similarity between a tweet and the movie that it mentions o to handle the problems of: • sparsity, • high dimensionality in bag of word features, • poor semantics o based on tweet & movie features, such as: • hashtags, • user description, • movie genres, • movie plot, ... Challenge@Recsys 2014 10
  • 11. Tweet-Movie Similarity 1. Building a semantic tweet representation by factoring the tweet matrix X1: X1(n1 ´m1) = A1(n1 ´ f1)´B1( f1 ´m1) where: n1 = #tweets, m1 = #features and f1 = #factors Tweet s Words Tweet s Latent Factors Latent Factors Words x 2. Mapping the movie data X2 into the latent space defined by B1 to compute the movie coefficient matrix A2 X2(n2 ´m1) = A2 (n2 ´ f1)´B1( f1 ´m1) where: n2 = #movies 3. Computing the similarity using the dot product of the corresponding rows of A1 and A2 sim(X,X ) = A´ AT 11 2 2 Challenge@Recsys 2014 11
  • 12. Two step ranking model 1. Cost sensitive classifier o Classify tweets into zero engagement and multiple non-zero classes o Imbalanced data  used cost sensitive classification 2. Ranker o List-wise, point-wise, pair-wise approaches o Predict the relevance values in Information Retrieval tasks o Engagement values considered as grades/labels • y =1, 2,…, l Challenge@Recsys 2014 12
  • 13. Cost sensitive classifier • More than 95% of the data have zero engagement value  o classical classification methods tend to misclassify the minority class (non-zeros engagements ) as the majority class (zero engagements) • Cost sensitive framework o assign different weights to errors of type I and type II o assign higher weights to tweets classified in Zero engagement class o use cost matrix C, for a sample x in class i, the following Loss function is minimized to find p: j å C(i,j) = the cost of predicting class i when the true class is j • Tweets classified as non-zero engagement are then passed to the ranker actual negative actual positive predicted negative C(0,0) C(0,1) predicted positive C(1,0) C(1,1) L(x, i) = P( j | x)C(i, j) Challenge@Recsys 2014 13
  • 14. Ranker • Let y the grade/ label set be: • Let twitter_user_ids: and y =1,2,..., l u1,u2,...,um a set of tweets associated with user ui Ti = ti,1, ti,2,..., ti,ni then y= y, y,..., yis the set of grades associated with user u(nis i i,1i,2i,ni i i the size of Tand y) i i• The classified training set is denoted as: i, j x =F( i u, S = {(ui,Ti ), m yi } i=1 • A feature vector is generated from each group-tweet pair i u, • Our goal: to train ranking models which can assign a score to a given pair • Used Random Forrest (RF) o uses independent subsets o parallelizable, robust to noisy data, capable of learning disjunctive expressions i, j t ) ( i, j t ) Challenge@Recsys 2014 14
  • 15. Neighborhood-based Repairing for Tweets with Predicted Zero Engagement non-zero engagment non-zero engagment zero engagment zero engagment Challenge@Recsys 2014 15
  • 16. Neighborhood-based Repairing for Tweets with Predicted Zero-Engagement • Goal: to correct predictions of non-zero tweets misclassified as the zero engagement tweets • Added neighborhood based approach after step 1: 1. Compute similarity between training and test tweets in common Latent space (computed using NMF) 2. Find the NT nearest tweets in the training for each test tweet classified as zero-engagement 3. Reassign the predicted engagement Challenge@Recsys 2014 16
  • 17. Neighborhood-based Repairing for Tweets with Predicted Zero Engagement • Varied neighborhood size: NT= 5,10 or 20 • Used two options to reassign predicted engagements: 1. Predict non-zero engagement If either of the neighbors’ engagements is non-zero 2. Predict engagement = the most frequent engagement value among neighbors • Select a margin based on cosine similarity to consider a set of zero-predicted-engagement test tweets to become candidates to be repaired o  move to non-zero predicted engagement Challenge@Recsys 2014 17
  • 18. Steps in Process classification refine classification ranking the non-zero tweets merge the zeros with the ranked non-zeros tweets Challenge@Recsys 2014 18
  • 19. Results • Step 1- Classifier o Used Weka with a cost sensitive framework o The engagement was discretized into 6 classes class 1 • only 0s class 2 • only 1s class 3 • 2-10 class 4 • 11-20 class 5 • 21-50 class 6 • values > 50 o Adaboost gave the best result • 99% classification accuracy • true positive rate = 0.74 in the minority class (non-zero engagements) • false positive rate = 0.01 • Step 2- Ranker o Built global ranking model o Used RankLib implemented in Java o Engagement is considered the target value to be ranked Challenge@Recsys 2014 19
  • 20. Results • nDCG@10 for the test data. • Ranker applied indiscriminately on both zero and non-zero class tweets. • This result is only shown to appreciate the impact of the classier in Step 1. Ranking Algorithm All features Excluding IMDb Features Excluding graph-propagated features Random Forest 0.553 0.503 0.485 LambdaMAR T 0.466 0.422 0.411 RankBoost 0.432 0.417 0.406 Challenge@Recsys 2014 20
  • 21. Results • Merging the zero and non-zero predicted engagement to calculate nDCG@10 – by appending the zero-class tweets at the end of the non-zero tweets for each user and then sorting the tweets again based on the user id Ranking Algorithm All features Excluding IMDb Features Excluding graph-propagated features Random Forest 0.805 0.503 0.485 LambdaMART 0.466 0.422 0.411 RankBoost 0.432 0.417 0.406 Challenge@Recsys 2014 21
  • 22. Results • Effect of repairing the predicted zero engagement tweets on nDCG@10 values for varying margin sizes defined by different similarity thresholds and neighborhood sizes • varied the neighborhood size, NT= 5,10,15 • for NT = 10 nearest neighbors and a similarity threshold of 0.9 to confine the margin where tweets are repaired  increased the nDCG@10 value to 0.817 Challenge@Recsys 2014 22
  • 23. Lessons Learned • What helped o Adding additional sources of data (IMDb). o Dedicated janitorial work on the data o Feature Engineering: selection, extraction, and construction of relevant features. o Mapping data (tweets and movies) to a Latent Factor Space using NMF o Binary classification of zero and non-zero engagement tweets prior to ranking. o Cost-sensitive classification in Step 1 o Using Learning to Rank (LTR) methods in Step 2. o Repairing misclassified non-zero engagement tweets within a limited margin, and then re-ranking them. Challenge@Recsys 2014 23
  • 24. Lessons Learned • What hurt o Over or under-sampling to handle class imbalance o Spending too much time on filling missing values for many features did not have any return on investment o Not having the luxury of better computational power o Building separate models based on whether or not the users and movies were common between the training and test sets Challenge@Recsys 2014 24
  • 25. Conclusion • If more time or computational power were within our reach, we would further explore several directions: o Exploring other LTR options in addition to the pointwise approach, including pairwise and listwise LTR. o Domain-informed Transductive learning o Exploring additional extracted or constructed features that may affect engagement. o Exploring high performance computing to compute many embarrassably parallellizable tasks. Challenge@Recsys 2014 25