In this paper we describe the KTIML team approach to RuleML 2015 Rule-based Recommender Systems for the Web of Data Challenge Track. The task is to estimate the top 5 movies for each user separately in a semantically enriched MovieLens 1M dataset. We have three results. Best is a domain specif-ic method like "recommend for all users the same set of movies from Spiel-berg". Our contributions are domain independent data mining methods tailored for top-k which combine second order logic data aggregations and transfor-mations of metadata, especially 5003 open data attributes and general GAP rules mining methods.
2. Content
• Data
• Task
• Mining – heuristics, domain specific, …
• Some results
• Mining - transferable methods , data aggregations
• Some results
• Oracle DB Data Miner
• Second order logic GAP rules
• Conclusions
RuleML-2015 Challenge Rule-based RS
for the web of data
Transformation and aggregation preprocessing for top-k
recommendation GAP rules induction
2
4. Task
• Run Python script train data – intermediate join processing size big,
redundant (for each UserID,MovieID the 5003 movie data repeat)
• For each user find 5 movies that best match a user profile top5(u)
• Submit CSV format: userId, movieId, scoren
• Observations
• Score does not affect system response, only (unordered) sets are
compared
• P, R, F@5 between top5(u) and varying size target (estimated average
size of target is 9.4 resp. 8, depending on assumptions)
RuleML-2015 Challenge Rule-based RS
for the web of data
Transformation and aggregation preprocessing for top-k
recommendation GAP rules induction
4
5. Mining – heuristics, domain specific, …
• 5003 DBPedia attributes – most frequent, clusters of properties, tried
mining, no relevant results (acquaintance with data)
• per attribute:
• relative frequency in ratings, NLP extraction
MAKEUP,VISUAL,SMIX,SEDIT,SPIELBERG,NY,CALIF,NOVELS,CAMERON,LA,ARIZONA,WILLIAMS
• KSI Pure first order logic with weighted average F = 0.05262 (our third)
• 0-1 order agreement with ratings ( good properties)
• 100*Movies.Spielberg + 50*Movies.Original + Movies.BayesAVG
• SCS_CUNI “Spielberg” F = 0.10681 (our best)
• Script downloaded table Xratings DB Ratings gave surprise
• disqualified Did not use only the training/test set F = 0.6987
• Precision: 0.9994 * 5000 = 4997 – three users have target set of size 4
RuleML-2015 Challenge Rule-based RS
for the web of data
Transformation and aggregation preprocessing for top-k
recommendation GAP rules induction
5
6. Transferable methods , data aggregations
• GenreMatch (genres in users ratings versus movie genres) and decision
tree drastic pruning
• KTIML Data mining combined with first order 0.10085 (our second)
RuleML-2015 Challenge Rule-based RS
for the web of data
Transformation and aggregation preprocessing for top-k
recommendation GAP rules induction
6
RulePreference Rule
0.11 R1:GoodProperty=1
0.25 R2: 113.5<CNT<400
0.29 R3: R1 and R2
0.58 R4: GoodProperty=0& CNT>399
0.57 R5: GoodProperty=1 & CNT>399
7. RuleML-2015 Challenge Rule-based RS
for the web of data
Transformation and aggregation preprocessing for top-k
recommendation GAP rules induction
7
Oracle DB Data Miner
8. Second order logic GAP rules
• DB aggregations second order logic
• “simple” queries can be transformed to rules. E.g.
SELECT UserID, MovieID, 5 FROM Ordered_Prediction WHERE OrdNr <= 5; …
… 100*Movies.Spielberg + 50*Movies.Original + Movies.BayesAVG
• corresponds to GAP rule
• SCS_CUNI_Movie(u,m):100*x1+50*x2+ x3
• SPIELBERG(m): x1 & ORIGINAL(m): x2 & BAYESAVG(m):x3
• Semantics so far:
• 2GAP - facts extended by atomic predicates corresponding to tables resulting
from database aggregations e.g. SPIELBERG(m), ORIGINAL(m), BAYESAVG(m)
RuleML-2015 Challenge Rule-based RS
for the web of data
Transformation and aggregation preprocessing for top-k
recommendation GAP rules induction
8
9. Conclusions
• Data too big for rule induction tools – all processing in a relational DB
• Transformation via NLP extraction. Clustering and importance of
attributes
• Data base aggregation – CNT, AVG, ….
• “simple” rules (in a second order logic GAP)
• Rules give explanation intuitive for humans
• Precision - In ideal case we gave 75% of users at least one correct
recommendation
• Future work – distribution of learning quality along users (not only
AVG)
RuleML-2015 Challenge Rule-based RS
for the web of data
Transformation and aggregation preprocessing for top-k
recommendation GAP rules induction
9