This document summarizes Cataldo Musto's presentation on enhancing vector space models for content-based recommender systems. The presentation evaluated random indexing and semantic vectors models for building user profiles and recommending items. Experimental results on a movie dataset showed that the weighted models improved accuracy over the basic models, and that incorporating negative preferences through semantic vectors further improved accuracy over naive Bayes recommendations.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Enhanced Vector Space Models for Content-based Recommender Systems
1. ACM Recommender Systems 2010
Barcelona, Spain
Enhanced Vector Space
Models for Content-based
Recommender Systems
Cataldo Musto - cataldomusto@di.uniba.it
University of Bari “Aldo Moro” (Italy), SWAP Research Group
ACM Recsys 2010 Doctoral Symposium
26.09.10
2. outline 2/30
• Motivations
• Goals
• Analysis of Vector Space Models
• Enhanced Vector Space Models
• Random Indexing-based model
• Semantic Vectors-based model
• Experimental Evaluation
• Open Issues
• Future Works
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
3. vector space model 3/30
item 2 item n
item 1
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
4. vector space model 4/30
• Introduced by Salton in 1975
• Given a set of documents and given N features describing the
documents the VSM builds an N-dimensional Vector Space
• Each item is represented as a point in the Vector Space
• Application: Information Retrieval
• Query: point in the Vector Space
• Assumption: the nearest documents in the Vector Space
are the most relevant ones
• Cosine Similarity to compute the similarity between
query and documents
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
5. idea 5/30
• To investigate the impact of Vector Space Models in the
area of Information Filtering
• “Information Filtering & Information Retrieval: two sides of the same
coin?”, Belkin & Croft, 1992
• Strong Analogies
• Documents to be retrieved vs. Items to be filtered
• Query vs. User Profiles
• Both IF and IR can share the same weighting
techniques (TF/IDF) and similarity measures
(Cosine similarity)
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
6. vsm analysis 6/30
• Strong Points
• State-of-the-art model for the IR
community
• Clean and Solid formalism
• Simpleness of calculations between
objects in a VSM
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
7. vsm analysis (2) 7/30
• Weak Points
• High Dimensionality
• NLP operations (stopwords elimination, stemming and so on)
• Not incremental
• The whole Vector Space has to be generated from scratch
whenever a new item is added to the repository
• Does not manage the latent semantic of documents
• Any permutation of the terms in a document has the same
VSM representation!
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
8. goals 8/30
• To introduce tools and techniques able
to overcome these drawbacks
• Random Indexing
• Dimensionality reduction technique
• Sahlgren, 2005
• Semantic Vectors
• Java open-source package
• Widdows, 2007
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
9. random indexing 9/30
• Random Indexing (RI) is an incremental and
effective technique for dimensionality reduction
• Introduced by Sahlgren in 2005
• Based on the so-called “Distributional
Hypothesis”
• “Words that occur in the same context tend to
have similar meanings”
• “Meaning is its use” (Wittgenstein)
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
10. how it works? 10/30
• Random Indexing reduces
the m-dimensional term/doc
matrix to a new
k-dimensional matrix
• How?
• By multiplying the original matrix
with a random one, built in an
incremental way
• formally: An,m Rm,k = Bn,k
• k << m
• After projection, the distance
between points in the vector space
is preserved
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
11. random matrix 11/30
• How is the random matrix build?
• The whole process is based on the concept of
“context”
• Given a term, its “context” is the set of other
words it co-occurs with
• The matrix is built in an iterative and incremental way
• The vector representing each document depends on the
term that occur in it
• The vector representing each term depends on its context
(the other terms it co-occurs with)
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
12. item representation 12/30
• A context vector is assigned for each term. This
vector has a fixed dimension (k) and it can contain only
values in -1, 0,1. Values are distributed in a random way
but the number of non-zero elements is much smaller.
• The Vector Space representation of a term is obtained
by summing the context vectors of the terms it co-
occurs with.
• The Vector Space representation of a document
(item) is obtained by summing the context vectors of
the terms that occur in it
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
13. ...summing up 13/30
• Random Indexing
• Dimensionality reduction technique
• Similar to LSA
• Incremental
• Tremendous saving of computational resources
• Manages the semantics of documents
• The position of a document (item) in the vector space
depends on the position of the terms that occur in the
document
• The position of a terms depends on the position of the
other terms it co-occurs with!
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
14. recommendation models 14/30
• We developed two different
recommendation models
• Both based on vector space built
through Random Indexing
• Random Indexing-based
model (RI)
• Semantic Vectors-based
model (SV)
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
15. profile representation 15/30
• What about the user profiles?
• Assumption
• The information coming from documents (items) that
the user liked in the past could be a reliable source of
information for building user profiles
• The Vector Space representation of a user profile is obtained
by combining the context vectors of all the documents that the
user liked in the past.
• Definition of RI-based and SV-based models
• The difference lies in the way they exploit the vector space to
build user profiles
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
16. RI-based approach 16/30
Documents Rate Threshold
VSM representation of RI-based profile for user u
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
17. RI-based approach 17/30
• The simplest user profile
• Combines the information coming from
previously liked documents in an uniform
way
• Different ratings are not managed!
• Definition of a weighted
counterpart, called W-RI
• Weighted Random Indexing
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
18. wRI-based approach 18/30
Documents Rate Threshold
VSM representation of wRI-based profile for user u
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
19. wRI-based approach 19/30
• Both models inherit a classical problem
of VSM
• User profiles modeled only according
to positive preferences
• In classical text classifiers (Naive Bayes, SVM,
etc.) both positive and negative preferences
are modeled
• Definition of Semantic Vectors (SV)
based model to tackle this problem
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
20. semantic vectors 20/30
• Open-source package written in Java
• Implements a Random Indexing-based approach
for documents indexing
• Integrates a negation operator based on
quantum mechanics
• Query as “A not B” are allowed!
• Projection of vector A on the subspace orthogonal to
those generated by the vector B
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
21. SV-based approach 21/30
Positive User Profile Vector
Negative User Profile Vector
VSM representation of SV-based profile for user u
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
22. wSV-based approach 22/30
Positive User Profile Vector
Negative User Profile Vector
VSM representation of wSV-based profile for user u
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
23. recommendation step 23/30
• Given a user profile u and a set of items we can suppose that the most
relevant items for u are the nearest ones in the vector space
• RI and wRI: Submission of a query based on
• SV and wSV: Submission of a query based on
• Returns the items with as much as possible features from p+ and as
less as possible features from p-
• Cosine Similarity to rank the items
• Items whose similarity is under a certain threshold are labeled as non-relevant
and filtered
• Recommendation of the items with the highest similarity w.r.t.
liked documents are combined.
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
24. experimental evaluation 24/30
• 100k Movielens Dataset
• Content-based information crawled from
Wikipedia
• Movies without a Wikipedia entry were deleted
• 613 users, 520 items, 40k ratings
• 5-fold cross validation
• Average Precision @1, @3, @5, @7, @ 10
• NLP processing: stopwords elimination
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
25. experimental design 25/30
• Experiment 1
• Do the weighting schema
improve the predictive accuracy of the
recommendation models?
• Experiment 2
• Do the introduction of a negation
operator improve the predictive
accuracy of the recommendation models?
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
26. results - experiment 1 26/30
RI W-RI SV W-SV
86.4 87
86.125 86.5
85.85 86
85.575 85.5
85.3 85
AVP@1 AVP@5 AVP@10 AVP@1 AVP@5 AVP@10
• Our weighting model (even in this naive form) improves the
predictive accuracy of both RI-based and SV-based models
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
27. results - experiment 2 27/30
RI SV W-RI W-SV
87 87
86.5 86.5
86 86
85.5 85.5
85 85
AVP@1 AVP@5 AVP@10 AVP@1 AVP@5 AVP@10
• The integration of a negation operator based on quantum mechanics
improves the predictive accuracy of both RI-based and SV-based models
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
28. results 28/30
RI W-RI SV W-SV Bayes
Av-Precision@1 85.93 86.33 85.97 86.78 86.39
Av-Precision@3 85.78 85.97 86.19 86.33 85.97
Av-Precision@5 85.75 86.10 85.99 86.16 85.83
Av-Precision@7 85.61 85.92 85.88 85.95 85.77
Av-Precision@10 85.45 85.76 85.76 85.83 85.75
• SV and RI improve the Average Precision
with respect to the Naive Bayes approach
(currently implemented in our
recommender system)
28
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
29. conclusions 29/30
• Investigation of the impact of enhanced VSM in
the area of content-based recommender systems
• Use of Random Indexing for dimensionality
reduction
• Definition of RI and SV-based models
• Encouraging experimental results
• First results improve the predictive accuracy
obtained by classical content-based filtering
techniques (e.g. Bayes)
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
30. open issues & future works 30/30
• Work-in-progress
• Experimental Evaluation on a classical TF/IDF-based VSM
• Open Issues
• Looking for a state-of-the-art dataset for the evaluation of content-
based recommendation models
• Future Work
• Comparison of the predictive accuracy with different NLP steps
(stemming, entity recognition, POS-tagging and so on)
• Integration of Social Media (Facebook, Twitter, LinkedIn) for building
accurate user profiles by skipping the training step
• Integration of Linked Data-based representation (by exploiting
DBPedia data) to exploit explicit relationships between concepts
Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10
31. http://www.di.uniba.it/~swap/
discussion
Cataldo Musto - cataldomusto@di.uniba.it
University of Bari (Italy), SWAP Research Group
ACM Recsys 2010 Doctoral Symposium