(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
Recommender systems to help people move forward
1. Recommender
systems to help
people move
forward
RecSysNL meetup
Oct 17, 2018
Martijn Willemsen
M.C.Willemsen@tue.nl
PI of the recommender LAB
http://www.martijnwillemsen.nl/
@mcwillemsen
Decision Making, Process tracing, Cognition,
Recommender Systems, online behavior,
e-coaching, Data Science
2. My lab has a strong user-centric RecSys focus…, why?
• Because I failed as a (real) engineer ?
2
MSc in EE
2nd Bsc:
Technology and
Society
PhD in Decision
Making
Recommender
Systems
MSc Technology
and Society
PostDoc in
Process Tracing
Electrical
Engineering
3. My lab has a strong user-centric RecSys focus…, why?
• Because it is easier to get papers into the RecSys conference?
3
4. Surely not….
Because just optimizing accuracy is not enough…
REVEAL Workshop RecSys2018: (Joe Konstan)
“…if the recommender systems we are building are trained
to predict the very items that user founds by themselves
without recommendation (yes, I’m looking at you
Precision_at_k), then the usefulness of the recommender
becomes very debatable. ”
4
https://medium.com/@olivier.koch/recsys-
2018-recommender-systems-that-care-
16389e43114c
5. And optimizing for engagement/behavior is tricky…
Netflix tradeoffs popularity, diversity and accuracy
AB tests to test ranking between and within rows
Source: RecSys 2016, 18 Sept: Talk by Xavier Amatriain
http://www.slideshare.net/xamat/past-present-and-future-of-recommender-systems-and-industry-perspective
6. We don’t need the user:
Let’s do AB Testing!
Netflix used 5-star rating scales to get
input from users (apart from log data)
Netflix reported an AB test of thumbs
up/down versus rating:
Yellin (Netflix VP of product): “The result
was that thumbs got 200% more ratings
than the traditional star-rating feature.”
So is the 5-star rating wrong?
or just different information?
Should we only trust the behavior?
6
However, over time, Netflix
realized that explicit star
ratings were less relevant than
other signals. Users would rate
documentaries with 5 stars,
and silly movies with just 3
stars, but still watch silly
movies more often than those
high-rated documentaries.
http://variety.com/2017/digital/ne
ws/netflix-thumbs-vs-stars-
1202010492/
7. Behavior versus Experience
Looking at behavior…
• Testing a recommender against a random videoclip system, the
number of clicked clips and total viewing time went down!
Looking at user experience…
• Users found what they liked
faster with less ineffective
clicks…
Behaviorism is not enough!
(Ekstrand & Willemsen, RecSys 2016)
We need to measure user experience
and relate it to user behavior…
We need to understand user goals and
develop Rec. Systems that help users
attain these goals!
7
8. User-Centric Research can help us understand…
• How they perceive recommender algorithms (Ekstrand et
al. 2014)
• WHY users are satisfied… even when we reduce accuracy
by diversification (Willemsen et al. 2016)
• What inaction (non-behavior) means… (Zhao et al. 2018)
Or even help built systems to help people move forward
• Energy saving recommendations (Starke et al. 2017)
• Music Genre Explorer (Liang and Willemsen, 2018)
8
12. User-Centric Framework
Our framework adds the intermediate construct of perception that explains why
behavior and experiences changes due to our manipulations
13. User-Centric Framework
• And adds personal
and situational
characteristics
•
Relations modeled
using factor analysis
and SEM
Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C. (2012). Explaining the
User Experience of Recommender Systems. User Modeling and User-Adapted Interaction
(UMUAI), vol 22, p. 441-504 http://bit.ly/umuai
14. User Perceptions of
Differences in
Recommender Algorithms
Joint work with grouplens
Michael Ekstrand, Max Harper and Joseph Konstan
Ekstrand, M.D., Harper, F.M., Willemsen, M.C.& Konstan, J.A. (2014). User Perception of
Differences in Recommender Algorithms. In Proceedings of the 8th ACM conference on
Recommender systems (pp. 161–168). New York, NY, USA: ACM
15. Going beyond accuracy…
McNee et al. (2006): Accuracy is not enough
“study recommenders from a user-centric perspective to
make them not only accurate and helpful, but also a
pleasure to use”
But wait!
we don’t even know how the standard algorithms are
perceived… and what differences there are…
Compare 3 classic algorithms (Item-Item, User-User
and SVD) side by side (joint evaluation) in terms of
preference and perceptions
16. The task provided to the user
First impression
Perceived Diversity
& novelty and
satisfaction
Choice of algo
17. First look at the measurement model
• only measurement model relating the concepts (no
conditions)
• All concepts are relative comparisons
– e.g. if they think list A is more diverse than B, they are also more
satisfied with list A than B
SSA
EXPSSA
INT
INT
18. What algorithms do users prefer?
528 users completed the
questionnaire
Joint evaluation, 3 pairs of
comparing A with B
User-User CF significantly
looses from the other two
Item-Item and SVD are on par
Why?
– User-user more novel than either SVD or item-item
– User-user more diverse than SVD
– Item-item slightly more diverse than SVD (but diversity didn't
affect satisfaction)
I-I
I-I
SVD
U-U
SVD
U-U
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
I-I v. U-U I-I v. SVD SVD v. U-U
19. Objective measures
No accuracy differences, but consistent with subjective data
RQ2: User-user more novel, SVD somewhat less diverse
20. Aligning objective with subjective measures
Objective and subjective metrics correlate consistently
But their effects on choice are mediated by the subjective
perceptions!
(Objective) obscurity only influences satisfaction if it increases
perceived novelty (i.e. if it is registered by the user)
21. Conclusions
Novelty is not always good: complex, largely negative effect
Diversity is important for satisfaction
Diversity/accuracy tradeoff does not seem to hold…
Subjective Perceptions and experience mediate the effect
of objective measures on choice / preference for algorithm
Brings the ‘WHY’: e.g. User-user is less satisfactory and less
often chosen because of its obscure items (which are
perceived as novel)
22. Choice difficulty and
satisfaction in RecSys
Applying latent feature diversification
Willemsen, M.C., Graus, M.P, & Knijnenburg, B.P. (2016). Understanding the role of latent
feature diversification on choice difficulty and satisfaction. User Modeling and User-
Adapted Interaction (UMUAI), vol 26 (4), 347-389 doi:10.1007/s11257-016-9178-6
23. Seminal example of choice overload
Satisfaction decreases with larger sets as increased
attractiveness is counteracted by choice difficulty
Can we reduce difficulty while controlling attractiveness?
More attractive
3% sales
Less attractive
30% sales
Higher purchase
satisfaction
From Iyengar and Lepper (2000)
24. Koren, Y., Bell, R., and Volinsky, C. 2009. Matrix Factorization Techniques
for Recommender Systems. IEEE Computer 42, 8, 30–37.
Dimensionality reduction
Users and items are
represented as vectors on a
set of latent features
Rating is the dot product of
these vectors (overall utility!)
Gus will like Dumb and Dumber
but hate Color Purple
Use the properties of Matrix Factorization algorithms!
26. 26
Latent Feature Diversification
Psychology-
informed
Diversity
manipulation
Increased
perceived
Diversity &
attractiveness
Reduced
difficulty &
increased
satisfaction
Less hovers
More choice
for lower
ranked items
Diversification Rank of chosen
None (top 5) 3.6
Medium 14.5
High 77.6
-0.2
0
0.2
0.4
0.6
0.8
1
none med high
standardizedscore
diversification
Choice Satisfaction
Higher satisfaction for high
diversification, despite choice for
lower predicted/ranked items
27. Interpreting User
Inaction in
Recommender Systems
Zhao, Q., Willemsen, M. C., Adomavicius, G., Maxwell Harper, F., & Konstan, J. A. (2018).
Interpreting user inaction in recommender systems. In Proceedings of the 12th ACM
Conference on Recommender Systems (blz. 40-48). New York: Association for Computing
Machinery, Inc. DOI: 10.1145/3240323.3240366
28. 28
Action and Inaction in MovieLens.org
Add into a wishlist
Not interested
Rating
Click to see
details
30. 7 Categories of User Inaction
30
Did not notice it (38.6%)
Noticed but watched it before (14.6%)
Noticed and have
not watched it yet
(46.8%)
Would not enjoy it (5.8%)
Others are better (9.5%)
Okay but not now (18.2%)
Plan to explore it soon (6.9%)
Have decided to watch (5.8%)
31. Should movielens keep recommending this item to you?
most
preferred
least
preferred
4%
1%
11%
9%
30%
42%
63%
65%
51%
25%
27%
23%
19%
5%
32. Based on this survey data we:
Built an inaction classification model
• Predictors: item attributes, position, user actions, predicted
rating and action probabilities: (clicking, rating, adding to a
wishlist)
• Best accuracy: 48.5% (majority class is NotNoticed: 39.9%)
Try to improve the recommender system: what to do with
inaction items?
• Utilize inferred 7-class probabilities
• Estimate the (in)action or adjust the recommendation timing…
• Hide an item or Delay showing an item !
32
33. How recommenders
can help users
achieve their goals
Research with
Alain Starke
(PhD student)
RecSys 2017
33
34. Recommending for Behavioral change
• Behavioral change is hard…
– Exercising more, eat healthy, reduce alcohol consumption
(reducing Binge watching on Netflix )
– Needs awareness, motivation and commitment
Combi model:
Klein, Mogles, Wissen
Journal of Biomedical Informatics, 2014
35. What can recommenders do?
• Persuasive Technology: focused on how to help people change their
behavior:
– personalize the message…
• Recommenders systems can help with what to change and when to
act
– personalize what to do next…
• This requires different models/algorithms
– our past behavior/liking is not what we want to do now!
36. Central question
Can we design a recommender
interface which effectively supports a
user’s energy-saving goals?
36
37. Regular RecSys approaches, e.g. collaborative filtering,
are prone to reinforcing current behavior
• If we want consumers and users to achieve (energy-
saving) goals, we should not only focus on past
behavior but ‘move forward’ (cf. Ekstrand & Willemsen, 2016)
• We need a model which considers future goals
37
38. Energy-saving measures can be ordered as
increasingly difficult behavioral steps towards
attaining the goal of saving energy
(Kaiser et al., 2010; Urban & Scasny, 2014)
< <
38
39. These steps reflect willingness & capacity to
save energy: a person’s energy-saving ability
(Kaiser et al., 2010; Urban & Scasny, 2014)
< <
39
40. We infer behavioral difficulties
based on engagement frequencies
40
INPUT
Persons
indicate
which
measures
they
perform
Difficult /
Obscure
Easy /
Popular
41. In a similar vein, we infer
energy-saving abilities
41
Low ability,
Performs few
Persons
indicate
which
measures
they
perform
INPUT
High ability,
Performs many
42. One’s energy-saving ability is a good starting
point to look for appropriate measures
A person has a 50% probability of performing a measure
with a difficulty equal to his/her ability
42
44. Study 2: How should advice be tailored to
support energy-efficient choices?
(And can fit scores help to persuade users to
pick more challenging measures?)
44
45. 45
Web shop interface with three lists (tabs):
‘Base’, ‘Recommended’ and ‘Challenging’
• ‘Recommended’ contains 15 best-matching measures,
with fit scores ranging 100% to 60%
• ‘Base’ are easier, ‘challenging’ more difficult
Matched on their ability:
-1, 0 or +1 Logit
(75%, 50% of 25% likelihood)
Show a fit score (or not)
46. 3x2 Between-subject research design
• 3 levels of difficulty, determining contents of the
‘recommended’ list:
– Easy / below ability (~75% probability)
– Ability-tailored (~50% probability)
– Difficult / above ability (~25% probability)
• 2 levels of fit score: they were either shown or not
– The 100% score was consistent with the difficulty condition
– E.g. in the easy condition, measures below a user’s ability
(75%) had a 100% match score
47. Easy recommendations were perceived as
feasible and, in turn, supportive & satisfactory
SEM Statistics: χ²(140) = 198.693, p < 0.001, CFI = 0.992, TLI = 0.990
47 *** p < 0.001, ** p < 0.01, * p < 0.05.
Perceived
Feasibility
−.469***
Rec
difficulty
-
48. Easy recommendations were perceived as
feasible and, in turn, supportive & satisfactory
SEM Statistics: χ²(140) = 198.693, p < 0.001, CFI = 0.992, TLI = 0.990
48 *** p < 0.001, ** p < 0.01, * p < 0.05.
Perceived
Feasibility
−.469***
Perceived
Support
Choice
Satisfaction.234***
Rec
difficulty
.506***
.221***
-
+ +
49. • Users who felt supported selected more measures
• Satisfied users showed a higher % of follow-up
49 *** p < 0.001, ** p < 0.01, * p < 0.05.
Perceived
Feasibility
−.469***
Perceived
Support
Choice
Satisfaction.234***
No. of
chosen
items
%
Executed
items
Difficulty
chosen
items
Rec
difficulty
.506***
.221***
.385***
−.113**
-
-
++
+
+
+
+
50. Users chose slightly more measures
when presented easier ones
(Showing fit scores did not really matter)
50
51. Fit scores boosted satisfaction levels for easy
measures, but backfired for difficult ones
51
52. Lessons learned
• A satisfactory user interface can lead to the adoption of more
energy-saving measures (within system + after 4 weeks)
• Easy tailored measures seem to be attractive, as they were
perceived as feasible and chosen more often
• Fit scores were merely self-reinforcing, not persuasive to
attain ‘more difficult goals’
52
53. General conclusions
• Recommender systems are all about good UX
• Taking a psychological, user-oriented approach
we can better account for how users perceive
recommendations and reach their goals
• Behaviorism is not enough: an integrated user-
centric approach offers many insights/benefits!
• Recommenders should take into account user
goals and should adapt their algorithms to it…
53