Sistemas de Recomendação sem enrolação

Sistemas de Recomendação
sem enrolação!
Gabriel Moreira - @gspmoreira
Lead Data Scientist DSc. student
GCG Campinas
DataFest 2018

Introduction
"We are leaving the Information Age and entering the
Recommendation Age.".
Cris Anderson, "The long tail"
3

38% of sales
2/3 views
Recommendations are responsible for...
39% of top news
visualization

What can a Recommender Systems do?
2 - Prediction
Given an item, what is its relevance for
each user?
1 - Recommendation
Given a user, produce an ordered list matching the
user needs

Recommender System Methods
Recommender System
Content-based filtering Collaborative filtering
Model-based filteringMemory-based filtering
Item-basedUser-based
ML-based: Clustering, Association Rules,
Matrix Factorization, Neural Networks
Hybrid filtering+ =
Most popular

User-Based Collaborative Filtering
Similar interests
Likes
Recommends

Item-Based Collaborative Filtering
Likes Recommends
Who likes A also likes B
Likes
Likes

Collaborative Filtering based on Matrix Factorization

Collaborative Filtering
Advantages
● Works to any item kind (ignore attributes)
Drawbacks
● Usually recommends more popular items
● Cold-start
○ Cannot recommend items not already
rated/consumed
○ Needs a minimum amount of users to match
similar users

Frameworks - Recommender Systems
Python
Python / ScalaJava
.NET
Java

User-Based Collaborative Filtering (Java / Mahout)
// Loads user-item ratings
DataModel model = new FileDataModel(new File("input.csv"));
// Defines a similarity metric to compare users (Person's correlation coefficient)
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
// Threshold the minimum similarity to consider two users similar
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1,
similarity, model);
// Create a User-Based Collaborative Filtering recommender
UserBasedRecommender recommender = new
GenericUserBasedRecommender(model, neighborhood, similarity);
// Return the top 3 recommendations for userId=2
List recommendations = recommender.recommend(2, 3);
User,Item,Rating1,
15,4.0
1,16,5.0
1,17,1.0
1,18,5.0
2,10,1.0
2,11,2.0
2,15,5.0
2,16,4.5
2,17,1.0
2,18,5.0
3,11,2.5
input.csv
User-Based Collaborative Filtering example (Mahout)
1
2
3
4
5
6
https://mahout.apache.org/users/recommender/userbased-5-minutes.html

Content-Based Filtering
Similar content (e.g. actor)
Likes
Recommends

Advantages
● Does not depend upon other users
● May recommend new and unpopular items
● Recommendations can be easily explained
Drawbacks
● Overspecialization
● May be difficult to extract attributes from audio,
movies or images
Content-Based Filtering

Hybrid Recommender Systems
Composite
Iterates by a chain of algorithm, aggregating
recommendations.
Weighted
Each algorithm has as a weight and the final
recommendations are defined by weighted averages.
Some approaches...

CI&T Deskdrop dataset on Kaggle!
https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop
● 12 months logs
(Mar. 2016 - Feb. 2017)
● ~ 73k logged users interactions
● ~ 3k public articles shared in
the platform.

Recommender Systems in Python 101
https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101

1. Popularity Based: log(# clicks)
2. Content-Based Filtering: TF-IDF + Cosine Similarity
3. Collaborative Filtering: SVD Matrix Factorization
4. Hybrid Filtering: CB * CF

3 - Article A
2 - Non-relevant art.
Relevant (clicked) articles
Article A
Recall@N: Percentage of relevant
items within the Top-N items in a
ranked list.
Article B
...
...
7 - Article B
Recall@5
Recall@10
Evaluation - Top-N accuracy metrics

Why Deep Learning has a potential for RecSys?
1. Feature extraction directly from the content (e.g., image, text, audio)
Images Text Audio/Music
● CNN ● 1D CNN
● RNNs
● Weighted word
embeddings
● CNN
● RNN

Why Deep Learning has a potential for RecSys?
2. Heterogenous data handled easily
3. Dynamic behaviour modeling with RNNs
4. More accurate representation learning of users and items
○ Natural extensions of CF
5. RecSys is a complex domain
○ Deep learning worked well in other complex domains

News Recommender Systems
using Deep Learning

RecSys 2018 - DLRS - Deep Learning for Recommender Systems Workshop
29http://dlrs-workshop.org/dlrs-2018/program/
https://arxiv.org/abs/1808.00076

News Recommender Systems
The majority of web traffic (TREVISIOL et al. , 2014b)
30

News Recommender Systems (1/2)
News RS Challenges
1. Sparse user profiling
(LI et al. , 2011) (LIN et al. , 2014) (PELÁEZ et al. , 2016)
2. Users’ preferences shift
(PELÁEZ et al. , 2016) (EPURE et al. , 2017)
3. Fast growing number of items
(PELÁEZ et al. , 2016) (MOHALLICK; ÖZGÖBEK , 2017)
31

News Recommender Systems (2/2)
News RS Challenges
4. Accelerated item’s value decay (DAS et al. , 2007)
32
Articles age at click time distribution (G1 dataset)
10% 4 hours
25% 5 hours
50% (Median) 8 hours
75% 14 hours
90% 26 hours

CHAMELEON: A Deep Learning
Meta-Architecture for News
Recommendation

A conceptual model of news relevance factors
News
relevance
Topics Entities Publisher
News static properties
Recency Popularity
News dynamic properties
News article
User
TimeLocation Device
User current context
Long-term
interests
Short-term
interests
Global factors
Season-
ality
User interests
Breaking
events
Popular
Topics
Referrer
34

CHAMELEON meta-architecture
1. Deals with item cold-start scenario
2. Learns item representation from articles text and metadata
3. Leverages user context (location, time, device) and article context (popularity,
recency)
4. Provides session-based recommendations based on users’ short-term preferences
5. Supports streaming users clicks (online learning), without the need to retrain on
the whole historical dataset
6. Provides a modular structure for news recommendation, allowing its modules to
be instantiated by different advanced neural network architectures and methods
35

The CHAMELEON Meta-Architecture for News RS
Article
Context
Article
Content
Embeddings
Article Content Representation (ACR)
Textual Features Representation (TFR)
Metadata Prediction (MP)
Category Tags Entities
Article Metadata Attributes
Next-Article Recommendation (NAR)
Time
Location
Device
When a news article is published...
User context
User interaction
past read articles
Popularity
Recency
Article context
Users Past
Sessions
Article
Content
Embedding
candidate next articles
(positive and neg.)
active article
Active
Sessions
When a user reads a news article...
Predicted Next-Article Embedding
Session Representation (SR)
Recommendations Ranking (RR)
User-Personalized Contextual Article Embedding
Recommended
articles
Contextual Article Representation (CAR)
Content word embeddings
New York is a multicultural city , ...Publisher
Metadata
Attributes
News Article
Active user session
Module Sub-Module EmbeddingInput Output Data repositoryAttributes
Article Content Embedding
Legend:
Word
Embeddings
36

Article
Context
Article
Content
Embeddings
past read articles
Users Past
Sessions
(positive and neg.)
active article
Active
Sessions
Recommended
articles
Active user session
Module Sub-Module EmbeddingInput Output Data repositoryAttributesLegend:
Next-Article Recommendation
(NAR) module
● Provide news articles
recommendations for each interaction
(I) in active user sessions.
● For each recommendation request,
NAR module generates a ranked list
of the most likely articles user might
read in a given session.
37
Sessions mini-batch
I1,1
I1,2
I1,3
I1,4
I1,5
I2,1
I2,2
I3,1
I3,2
I3,3
Time
Location
Device
User context
User interaction
Popularity
Recency
Article context
Article
Content
Embedding

Article
Context
Article
Content
Embeddings
Time
Location
Device
User context
User interaction
past read articles
Popularity
Recency
Article context
Users Past
Sessions
(positive and neg.)
active article
Active
Sessions
Recommended
articles
Active user session
Article
Content
Embedding
38
Active User Session
Trains an RNN to predict next-clicked
items (represented by their
User-Personalized Contextual Article
Embedding), for each active session:
RNN input
I1
I2
I3
I4
I2
I3
I4
I5
Expected RNN output
(next-clicked items)

Article
Context
Article
Content
Embeddings
Time
Location
Device
User context
User interaction
past read articles
Popularity
Recency
Article context
Users Past
Sessions
(positive and neg.)
active article
Active
Sessions
Recommended
articles
Active user session
Article
Content
Embedding
39
Negative Sampling strategy
Unique articles read by users within the last
N hour/clicks buffer
Unique articles read by other user sessions
in the mini-batch
Next article read by user in his session
Samples

Article
Context
Article
Content
Embeddings
Time
Location
Device
User context
User interaction
past read articles
Popularity
Recency
Article context
Users Past
Sessions
(positive and neg.)
active article
Active
Sessions
Active user session
Recommendations Ranking
(RR) sub-module
Article
Content
Embedding
Eq. 7 - Loss function (HUANG et al., 2013)
Eq. 4 - Relevance Score of an item for a user session
Eq. 5 - Cosine similarity
Eq. 6 - Softmax over Relevance Score (HUANG et al., 2013)
40
Recommended
articles

CHAMELEON - NAR module loss function
Recommendation loss function implemented on TensorFlow

CHAMELEON
Architecture Instatiations

An architecture instantiation of CHAMELEON (1D CNN and LSTM)
43
Article
Context
Article
Content
Embeddings
Article Content Representation (ACR)
Textual Features Representation (TFR)
Metadata Prediction (MP)
Category
Target Article Metadata Attributes
Platform
Device Type
When a news article is published...
User context
User interaction
past read articles
Popularity
Recency
Article contextArticle
Content
Embedding
(positive and neg.)
active articleActive
Sessions
Recommended
articles
Content word embeddings
New York is a multicultural city , ...Publisher
Metadata
Attributes
News Article
Active user session
Module Sub-Module EmbeddingInput Output Data repositoryAttributes
Article Content Embedding
Legend:
Word
Embeddings
Convolutional Neural Network (CNN)
conv-3 (128)
max-pooling
conv-4 (128)
max-pooling
conv-5 (128)
max-pooling
Fully Connected
Fully Connected
Fully Connected
Fully Connected
LSTM

CHAMELEON Instantiation - Implementation
44
● This CHAMELEON architecture instantiation was implemented using
TensorFlow (available in https://github.com/gabrielspmoreira/chameleon_recsys)
● Training and evaluation performed in Google Cloud Platform ML Engine

Preliminary experiments - Dataset
46
● Provided by Globo.com (G1), the most popular news portal in Brazil
● Sample from Oct., 1 to 16, 2017, with over 3 M clicks, distributed in 1.2 M
sessions from 330 K users, who read over 50 K unique news articles
https://www.kaggle.com/gspmoreira/news-portal-user-interactions-by-globocom

ACR module training
47
Trained in a dataset with 364 K articles from 461 categories, to generate the
Articles Content Embeddings (vectors with 250 dimensions)
t-SNE visualization of trained Article Content
Embeddings (from top 15 categories)
Distribution of articles by the top 200 categories

NAR module evaluation
48
Temporal offline evaluation method:
1. Train the NAR module with sessions within the active hour
2. Evaluate the NAR module with sessions within the next hour
Task: For each item within a session, predict the next-clicked item from a set
composed by the positive sample (correct article) and 50 negative samples.
Metrics:
● Recall@5 - Checks whether the positive item is among the top-5 ranked
items
● MRR@5 - Ranking metric which assigns higher scores at top ranks.

49
Benchmark methods for session-based recommendations:
Neural Networks methods
1. GRU4Rec - Seminal neural architecture using RNNs for session-based recommendations
(Hidasi, 2016) with the improvements of (Hidasi, 2017) (v2).
Frequent patterns methods
2. Co-occurrent - Recommends articles commonly viewed together with the last read article, in
other user sessions (simplified version of the association rules technique, with the maximum rule
size of two) (Jugovac, 2018) (Ludewig, 2018)
3. Sequential Rules (SR) - A more sophisticated version of association rules, which considers the
sequence of clicked items within the session. A rule is created when an item q appeared after an
item p in a session, even when other items were viewed between p and q. The rules are
weighted by the distance x (number of steps) between p and q in the session with a linear
weighting function (Ludewig, 2018)

50
Benchmark methods for session-based recommendations:
KNN methods
4. Item-kNN - Returns most similar items to the last read article, in terms of the cosine similarity
between the vector of their sessions, i.e. it is the number of co-occurrences of two items in sessions
divided by the square root of the product of the numbers of sessions in which the individual items
are occurred.
5. Vector Multiplication Session-Based kNN (V-SkNN) - Compares the entire active session with
past sessions and find items to be recommended. The comparison emphasizes items more recently
clicked within the session, when computing the similarities with past sessions (Jannach,2017)
(Jugovac,2018) (Ludewig,2018)
Other baselines
6. Recently Popular - Recommends the most viewed articles from the last N clicks buffer
7. Content-Based - For each article read by the user, recommends similar articles based on the
cosine similarity of their Article Content Embeddings, from the last N clicks buffer.

51
Experiment #1
Continuous training and evaluating during 15 days (Oct. 1-15, 2017)
Average MRR@5 by hour (evaluation each 5 hours), for a 15-days period

52
Experiment #1
Continuous training and evaluating each five hours, during 15 days (Oct. 1-15, 2017)
13% of relative
improvement on MRR@5
Distribution of average MRR@5 by hour (sampled for evaluation), for a 15-days period

53
Experiment #2
Continuous training and evaluating each hour, on the subsequent day (Oct. 16, 2017)
Average MRR@5 by hour, for Oct. 16, 2017

ACM RecSys
55
RecSys 2018 - Vancouver, CA
● 12th edition
● 6 days of tutorials, main conference, workshops
● Over 800 attendants
○ 73% from industry
○ Top companies: Amazon, Google, Spotify
● 28% of paper acceptance rate
https://recsys.acm.org

References
https://www.slideshare.net/gabrielspmoreira/deep-recommender-systems-papisio-latam-2018

Questions?
Gabriel Moreira - @gspmoreira
Lead Data Scientist DSc. student
GCG Campinas
DataFest 2018

Sistemas de Recomendação sem enrolação

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Sistemas de Recomendação sem enrolação

Ähnlich wie Sistemas de Recomendação sem enrolação (20)

Mehr von Gabriel Moreira

Mehr von Gabriel Moreira (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sistemas de Recomendação sem enrolação