Linked Open Data to Support Content-based Recommender Systems
Tommaso Di Noia, Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito, Markus Zanker
I-SEMANTICS 2012, 8th Int. Conf. on Semantic Systems, Sept. 5-7, 2012, Graz, Austria.
The World Wide Web is moving from a Web of hyper-linked Documents to a Web of linked Data. Thanks to the Semantic Web spread and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets. These datasets are connected with each other to form the so called Linked Open Data cloud. As of today, there are tons of RDF data available in the Web of Data, but only few applications really exploit their potential power. In this paper we show how these data can successfully be used to develop a recommender system (RS) that relies exclusively on the information encoded in the Web of Data. We implemented a content-based RS that leverages the data available within Linked Open Data datasets (in particular DBpedia, Freebase and LinkedMDB) in order to recommend movies to the end users. We extensively evaluated the approach and validated the effectiveness of the algorithms by experimentally measuring their accuracy with precision and recall metrics.
Designing IA for AI - Information Architecture Conference 2024
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
1. LINKED OPEN DATA TO SUPPORT
CONTENT-BASED RECOMMENDER SYSTEMS
Tommaso Di Noia1, Roberto Mirizzi2, Vito Claudio Ostuni1, Davide Romito1, Markus Zanker3
t.dinoia@poliba.it, roberto.mirizzi@hp.com, ostuni@deemail.poliba.it, romito@deemail.poliba.it, markus.zanker@uni-klu.ac.at
1Politecnico di Bari 2HP Labs 3Alpen-Adria-Universität Klagenfurt
Via Orabona, 4 1501 Page Mill Road Universitätsstraße 65 -67
70125 Bari (ITALY) Palo Alto, CA (US) 94304 9020 Klagenfurt, Austria
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
2. Outline
What are (Content-based) Recommender Systems?
The main drawback: limited content analysis
Vector Space Model for Linked Open Data (LOD)
Vector Space Model adapted to RDF graphs
A Semantic Content-based Recommender System
A Memory-based algorithm which uses a LOD-based item similarity measure
Evaluation
Precision and Recall experiments with MovieLens
Conclusion
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
3. Recommender Systems
A definition
Recommender Systems (RSs) are software tools and techniques
providing suggestions for items to be of use to a user.
[F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, editors. Recommender Systems Handbook. Springer, 2011.]
Input Data:
A set of users U={u1, …, uN}
A set of items I={i1, …, iM}
The rating matrix R=[ru,i]
Problem Definition:
Given user u and target item i
Predict the rating ru,i
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
4. Content-based Recommender Systems
CB-RSs recommend items to a user based on their description and on the profile
of the user’s interests *
Item1, 5
Item2, 1
Item5, 4
Item10, 5
…. Top-N Recommendations
User profile
Item7
Recommender Item15
System Item11
…
Items
Item1 Item2
….
Item’s Item100
descriptions
(*) Pazzani, M. J., & Billsus, D. Content-Based Recommendation Systems. The Adaptive Web. Lecture Notes in Computer Science vol. 4321, 325-341, 2007
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
5. Main CB RS Drawback: Limited Content Analysis
No suggestion is available if the analyzed content does not contain enough
information to discriminate items the user might like from items the user
might not like.*
The quality of CB recommendations are correlated with the quality of
the features that are explicitly associated with the items.
Need of domain knowledge!
We need rich descriptions of the items!
(*) P. Lops, M. de Gemmis, G. Semeraro. Content-based Recommender Systems: State of the Art and Trends. In: P. Kantor, F. Ricci, L. Rokach and B. Shapira,
editors, Recommender Systems Handbook: A Complete Guide for Research Scientists & Practitioners
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
6. A Linked Data based Solution
Use Linked Data to mitigate
the limited content analysis
issue
Plenty of structured data
available No Content
Analyzer required
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
7. LINKED DATA as
structured information source for item’s descriptions
Let’s use all this ontological knowledge to build smarter CB RSs
Rich items descriptions
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
8. Computing similarity in LOD datasets
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
9. Vector Space Model for LOD (i)
Quick recap on Vector Space Model
Vector Space Model is an algebraic model
for representing both text documents
and queries as vectors of index terms wt,d
that are positive and non-binary.
T
vd w1,d , w2,d ,..., wN ,d
wt ,d tft ,d idft
nt ,d D
tft ,d idft log
k
nk ,d d D t d
' '
[http://en.wikipedia.org/wiki/File:Vector_space_model.jpg]
N
d j dq wi , j wi ,q
sim(d j , q) i 1
N N
dj q
i 1
w2 i , j i 1
w2 i , q
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
10. Vector Space Model for LOD(ii)
Righteous Kill
Heat
Righteous Kill
Al Pacino
Robert De Niro
Brian Dennehy
Heat
Robert De Niro starring
Al Pacino
Brian Dennehy
John Avnet
Serial killer films
Heist films
Crime films genre
subject/broader
Drama director
starring
Crime films
Heat
Brian Dennehy
Drama
John Avnet
Righteous Kill
Heist films
Robert De Niro
Al Pacino
Serial killer films
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
11. Vector Space Model for LOD(iii)
Robert Brian
Al Pacino
STARRING De Niro Dennehy
(a1)
(a2) (a3)
Righteous
Kill (m1)
Heat (m2)
Righteous Kill Heat
wactorx ,moviey tf actorx ,moviey idf actorx
Righteous Kill (m1) wa1,m1 wa2,m1 wa3,m1
Heat (m2) wa1,m2 wa2,m2 0
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
13. Semantic Content-based Recommender
Given a user profile, defined as:
profile(u) m j , v j v j =1 if u likes m j , v j =-1 otherwise
We predict the rating using a Nearest Neighbor Classifier wherein the similarity
measure is a linear combination of local property similarities
p sim p (m j , mi )
m j profile ( u )
vj p
P
r (u , mi )
profile(u )
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
14. Training the system(i)
In order to identify the best possible values for the coefficients p
(i.e., the weights associated to the properties), we train the system
via a genetic algorithm.
Fitness function: Minimize the number of misclassification errors ei on the
training data (user profile)
Optimization
training data user u optimal values
Item1, 1 Min ei (p1 p2 p3 ….)
Item2, -1
Item5, 1
….
| profile ( u )|
User profile
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
15. Training the system(ii)
In some cases (e.g. new user problem) the user could have not rated any item yet.
The user-profile is empty. We cannot learn the αp coefficients!
Look at Amazon.com
Use Amazon’s collaborative results to capture movie similarities
We collected a set of 1000 movies from Amazon.
For each one of these movies we look at the correspondent recommendation list.
Righteous Kill Heat
Increment the weights αp associated to
First suggestion
the common properties between the
two movies.
e.g. They have same actors in common
and no directors. Hence we can increase
the weight of the property starring.
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
16. Experiment settings(i)
MovieLens 1M dataset
One-One mapping between MovieLens and DBpedia
Using SPARQL queries and Levensthein Distance
3,654 matched movies on 3,952
Binarization of the 1-5 rating scale
profile(u) m j , v j v j =1 if r(u,m j ) ru , v j =-1 otherwise
Evaluation goal : Top-N recommendations
Metrics: Precision@n + Recall@n
Rec @ N TestSet Rec @ N TestSet
P@ N R@ N N 1, 2...20
N TestSet
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
17. Experiment settings(ii)
Properties
dcterms:subject + skos:broader + DBpedia Ontology +
Freebase + LinkedMDB genres
Extracted Graph
53,840 actors, 18,149 directors, 29,352 distinct writers and
27,035 categories from DBpedia
667 genres from Freebase
26 genres from LinkedMDB
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
18. Alpha-coefficients evaluation
The α-coefficents obtained
with the genetic algorithm
give us the best
performance.
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
19. Property subset evaluation
The subject+broader
solution is better than only
subject or subject+more
broaders.
Too many broaders
introduce noise.
The best solution is
achieved with
subject+broader+
genres.
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
20. Evaluation against other approaches
Our solution outperforms
a Linked Data approach
(LDSD) and others
content-based which do
not leverage LOD.
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
21. Conclusion & Future directions
The huge amount of data available on Linked Data datasets can be successfully
exploited to overcome limited content analysis.
We have presented a semantic version of the classical vector space model to
compute item similarities.
Evaluation against historical datasets and high values of precision and recall prove
the validity of our approach.
We are currently working on:
Testing the approach with different domains
Improving the recommendation with a hybrid approach (content-based and collaborative filtering)
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria
22. Q&A
We acknowledge partial support of HP IRP 2011. Grant CW267313.
I-SEMANTICS 2012 – 8th Int. Conference on Semantic Systems
September 5-7, 2012 Graz, Austria