Enhanced Vector Space Models for Content-based Recommender Systems

ACM Recommender Systems 2010
Barcelona, Spain

Enhanced Vector Space
Models for Content-based
Recommender Systems
Cataldo Musto - cataldomusto@di.uniba.it

University of Bari “Aldo Moro” (Italy), SWAP Research Group
ACM Recsys 2010 Doctoral Symposium
26.09.10

outline 2/30

• Motivations
• Goals
• Analysis of Vector Space Models

• Enhanced Vector Space Models
• Random Indexing-based model
• Semantic Vectors-based model

• Experimental Evaluation
• Open Issues
• Future Works

Cataldo Musto, Enhanced Vector Space Models for Content-based Recommender Systems - ACM RecSys 2010 Doctoral Symposium - Barcelona, Spain - 26.09.10

vector space model 3/30

item 2 item n

item 1


vector space model 4/30

• Introduced by Salton in 1975
• Given a set of documents and given N features describing the
documents the VSM builds an N-dimensional Vector Space
• Each item is represented as a point in the Vector Space
• Application: Information Retrieval
• Query: point in the Vector Space
• Assumption: the nearest documents in the Vector Space
are the most relevant ones
• Cosine Similarity to compute the similarity between
query and documents


idea 5/30

• To investigate the impact of Vector Space Models in the
area of Information Filtering
• “Information Filtering & Information Retrieval: two sides of the same
coin?”, Belkin & Croft, 1992

• Strong Analogies
• Documents to be retrieved vs. Items to be ﬁltered
• Query vs. User Proﬁles
• Both IF and IR can share the same weighting
techniques (TF/IDF) and similarity measures
(Cosine similarity)


vsm analysis 6/30

• Strong Points
• State-of-the-art model for the IR
community
• Clean and Solid formalism
• Simpleness of calculations between
objects in a VSM


vsm analysis (2) 7/30

• Weak Points
• High Dimensionality
• NLP operations (stopwords elimination, stemming and so on)
• Not incremental
• The whole Vector Space has to be generated from scratch
whenever a new item is added to the repository
• Does not manage the latent semantic of documents
• Any permutation of the terms in a document has the same
VSM representation!


goals 8/30

• To introduce tools and techniques able
to overcome these drawbacks
• Random Indexing
• Dimensionality reduction technique
• Sahlgren, 2005
• Semantic Vectors
• Java open-source package
• Widdows, 2007

random indexing 9/30

• Random Indexing (RI) is an incremental and
effective technique for dimensionality reduction
• Introduced by Sahlgren in 2005

• Based on the so-called “Distributional
Hypothesis”
• “Words that occur in the same context tend to
have similar meanings”
• “Meaning is its use” (Wittgenstein)


how it works? 10/30

• Random Indexing reduces
the m-dimensional term/doc
matrix to a new
k-dimensional matrix

• How?
• By multiplying the original matrix
with a random one, built in an
incremental way
• formally: An,m Rm,k = Bn,k
• k << m
• After projection, the distance
between points in the vector space
is preserved


random matrix 11/30

• How is the random matrix build?
• The whole process is based on the concept of
“context”
• Given a term, its “context” is the set of other
words it co-occurs with

• The matrix is built in an iterative and incremental way
• The vector representing each document depends on the
term that occur in it
• The vector representing each term depends on its context
(the other terms it co-occurs with)


item representation 12/30

• A context vector is assigned for each term. This
vector has a ﬁxed dimension (k) and it can contain only
values in -1, 0,1. Values are distributed in a random way
but the number of non-zero elements is much smaller.
• The Vector Space representation of a term is obtained
by summing the context vectors of the terms it co-
occurs with.
• The Vector Space representation of a document
(item) is obtained by summing the context vectors of
the terms that occur in it


...summing up 13/30

• Random Indexing
• Dimensionality reduction technique
• Similar to LSA
• Incremental
• Tremendous saving of computational resources
• Manages the semantics of documents
• The position of a document (item) in the vector space
depends on the position of the terms that occur in the
document
• The position of a terms depends on the position of the
other terms it co-occurs with!


recommendation models 14/30

• We developed two different
recommendation models
• Both based on vector space built
through Random Indexing
• Random Indexing-based
model (RI)
• Semantic Vectors-based
model (SV)


profile representation 15/30

• What about the user profiles?
• Assumption
• The information coming from documents (items) that
the user liked in the past could be a reliable source of
information for building user profiles
• The Vector Space representation of a user profile is obtained
by combining the context vectors of all the documents that the
user liked in the past.

• Definition of RI-based and SV-based models
• The difference lies in the way they exploit the vector space to
build user profiles


RI-based approach 16/30

Documents Rate Threshold

VSM representation of RI-based proﬁle for user u

RI-based approach 17/30

• The simplest user proﬁle
• Combines the information coming from
previously liked documents in an uniform
way
• Different ratings are not managed!
• Deﬁnition of a weighted
counterpart, called W-RI
• Weighted Random Indexing

wRI-based approach 18/30

Documents Rate Threshold

VSM representation of wRI-based proﬁle for user u

wRI-based approach 19/30

• Both models inherit a classical problem
of VSM
• User profiles modeled only according
to positive preferences
• In classical text classifiers (Naive Bayes, SVM,
etc.) both positive and negative preferences
are modeled

• Definition of Semantic Vectors (SV)
based model to tackle this problem


semantic vectors 20/30

• Open-source package written in Java
• Implements a Random Indexing-based approach
for documents indexing

• Integrates a negation operator based on
quantum mechanics
• Query as “A not B” are allowed!
• Projection of vector A on the subspace orthogonal to
those generated by the vector B


SV-based approach 21/30

Positive User Profile Vector

Negative User Profile Vector

VSM representation of SV-based profile for user u


wSV-based approach 22/30

Positive User Profile Vector

Negative User Profile Vector

VSM representation of wSV-based profile for user u


recommendation step 23/30

• Given a user proﬁle u and a set of items we can suppose that the most
relevant items for u are the nearest ones in the vector space
• RI and wRI: Submission of a query based on
• SV and wSV: Submission of a query based on
• Returns the items with as much as possible features from p+ and as
less as possible features from p-

• Cosine Similarity to rank the items
• Items whose similarity is under a certain threshold are labeled as non-relevant
and ﬁltered
• Recommendation of the items with the highest similarity w.r.t.
liked documents are combined.


experimental evaluation 24/30

• 100k Movielens Dataset
• Content-based information crawled from
Wikipedia
• Movies without a Wikipedia entry were deleted
• 613 users, 520 items, 40k ratings
• 5-fold cross validation
• Average Precision @1, @3, @5, @7, @ 10
• NLP processing: stopwords elimination


experimental design 25/30

• Experiment 1
• Do the weighting schema
improve the predictive accuracy of the
recommendation models?
• Experiment 2
• Do the introduction of a negation
operator improve the predictive
accuracy of the recommendation models?


results - experiment 1 26/30

RI W-RI SV W-SV

86.4 87

86.125 86.5

85.85 86

85.575 85.5

85.3 85
AVP@1 AVP@5 AVP@10 AVP@1 AVP@5 AVP@10

• Our weighting model (even in this naive form) improves the
predictive accuracy of both RI-based and SV-based models

results - experiment 2 27/30

RI SV W-RI W-SV

87 87

86.5 86.5

86 86

85.5 85.5

85 85
AVP@1 AVP@5 AVP@10 AVP@1 AVP@5 AVP@10

• The integration of a negation operator based on quantum mechanics
improves the predictive accuracy of both RI-based and SV-based models

results 28/30
RI W-RI SV W-SV Bayes
Av-Precision@1 85.93 86.33 85.97 86.78 86.39
Av-Precision@3 85.78 85.97 86.19 86.33 85.97
Av-Precision@5 85.75 86.10 85.99 86.16 85.83
Av-Precision@7 85.61 85.92 85.88 85.95 85.77
Av-Precision@10 85.45 85.76 85.76 85.83 85.75

• SV and RI improve the Average Precision
with respect to the Naive Bayes approach
(currently implemented in our
recommender system)
28

conclusions 29/30

• Investigation of the impact of enhanced VSM in
the area of content-based recommender systems
• Use of Random Indexing for dimensionality
reduction
• Deﬁnition of RI and SV-based models
• Encouraging experimental results
• First results improve the predictive accuracy
obtained by classical content-based ﬁltering
techniques (e.g. Bayes)


open issues & future works 30/30

• Work-in-progress
• Experimental Evaluation on a classical TF/IDF-based VSM
• Open Issues
• Looking for a state-of-the-art dataset for the evaluation of content-
based recommendation models
• Future Work
• Comparison of the predictive accuracy with different NLP steps
(stemming, entity recognition, POS-tagging and so on)
• Integration of Social Media (Facebook, Twitter, LinkedIn) for building
accurate user proﬁles by skipping the training step
• Integration of Linked Data-based representation (by exploiting
DBPedia data) to exploit explicit relationships between concepts


http://www.di.uniba.it/~swap/

discussion

Cataldo Musto - cataldomusto@di.uniba.it

University of Bari (Italy), SWAP Research Group
ACM Recsys 2010 Doctoral Symposium

Enhanced Vector Space Models for Content-based Recommender Systems

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Enhanced Vector Space Models for Content-based Recommender Systems

Ähnlich wie Enhanced Vector Space Models for Content-based Recommender Systems (20)

Mehr von Cataldo Musto

Mehr von Cataldo Musto (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Enhanced Vector Space Models for Content-based Recommender Systems

Hinweis der Redaktion