Injustice - Developers Among Us (SciFiDevCon 2024)
Discovery
1. The Guide to Predictive Analytics
A FINDERBOTS.COM
PRODUCTION
DISCOVERY
2. FINDERBOTS.COM
• Independent Consulting Service
• Specialize in Big-data Predictive Analytics
• Recommenders
• Personalized discovery
• Search optimization and personalization
• Committer to open source machine learning projects
(Apache Mahout, Finderbots Solr-recommender)
Pat Ferrel
pat@finderbots.com
A FINDERBOTS.COM
PRODUCTION
3. DISCOVERY:
• Browse
• editorial categories
• user generated content—tags, hashtags, comments, likes, shares
• realtime predictive analytics driven “concepts”
• Search
• keywords is not enough
• inferred keywords (from usage data)
• personalized search (from collaborative filtering data, just like Google)
• Recommendations
• profile based, content based, usage based
• entire catalog can be skewed by predictive analytics
• required
• why?
A FINDERBOTS.COM
PRODUCTION
4. DISCOVERY:
• Browse
• editorial categories
• user generated content—tags, hashtags, comments, likes, shares
• realtime predictive analytics driven “concepts”
Netflix—80% of views
• Search
• keywords is not enough
• inferred Amazon—keywords (from 60% usage of data)
sales
• personalized search (from collaborative filtering data, just like Google)
• Recommendations
Yahoo News—40% increase in TOS
• profile based, content based, usage based
• entire catalog can be skewed by predictive analytics
• required
• why?
Better Discovery = Better Engagement
A FINDERBOTS.COM
PRODUCTION
6. RECOMMENDATIONS CAN DO
WHAT SEARCH CANNOT
• Search for “leather laptop bag”
• Hmm, some are ok but not quite right
• Put some in “wishlist”
• Look at recommendations
• Add and remove as you like…
A FINDERBOTS.COM
PRODUCTION
…things improve!
• Never knew I wanted a
“Messenger bag with a leather strap”
• Didn’t know what one was
so would never have searched for it
7. SEARCH THAT KNOWS WHAT
THE USER MEANS
• Search for “leather laptop bag”
• Buy “leather messenger bag with leather strap”
• With the right usage data we can infer “messenger bag” =
“laptop bag”
• Now
–the the words I know
will get me
–the object I want
even though
–I didn’t know how to ask for it
A FINDERBOTS.COM
PRODUCTION
8. THE CUTTING EDGE IN
PREDICTIVE ANALYTICS
• Uses any number of user actions—entire user clickstream
• Uses metadata—from user profile or item
• Uses context—on-site, time, location
• Uses content—unstructured text or semi-structured
• Personalizes recommendations even when content-based
• Mixes any number of “indicators” to increase quality or tune to
specific context
• Solves the “cold-start” problem—items with too short a lifespan
• Can recommend to new users in realtime
• Improves Search
• Personalizes Search
A FINDERBOTS.COM
PRODUCTION
9. THE GOOD NEWS
• 90% of these features come from 3
technologies
• Search engine (Solr, Elasticsearch)
• Mahout
• Spark
• 90% of the flexibility comes at runtime
via query—not from new analytical
models.
A FINDERBOTS.COM
PRODUCTION
11. ARCHITECTURE
action logging HDFS
A FINDERBOTS.COM
PRODUCTION
action logs
Mahout 1.0
spark-itemsimilarity
cooccurrence
indicators
Scalable
Store
HDFS or DB
content or
metadata =
intrinsic indicators
Spark
Mahout 1.0
spark-rowsimilarity
Application
Catalog
creation and
editing
query
indicators
index
Search Engine
realtime background
12. ANATOMY OF A
RECOMMENDATION
r = recommendations
hp = a user’s history of some primary action
(purchase for instance)
P = the history of all users’ primary action
rows are users, columns are items
[PtP] = compares column to column using
log-likelihood based cooccurrence
A FINDERBOTS.COM
PRODUCTION
r = hp[PtP]
13. THE UNIVERSAL
RECOMMENDER
• Virtually all collaborative filtering type
recommenders can use only one indicator of
preference—one action
r = hp[PtP]
• But the theory doesn’t stop there
r = hp[PtP] + hv[VtP] + hc[CtP] + …
• Virtually all user actions can be used to improve
recommendations—purchase, view, category
view…
A FINDERBOTS.COM
PRODUCTION
14. A COOCCURRENCE
INDICATOR
• [PtP] is an indicator matrix for some primary action
like purchase
• Rows = users, columns = items, boolean data
• Compares cooccurring interactions using the log-likelihood
A FINDERBOTS.COM
PRODUCTION
ratio—column-wise similarity
• LLR finds important cooccurrences and filters out
the rest
• Comparing the history of the primary action to
other actions finds the secondary actions that lead
to the primary—the effect is to scrub secondary
actions of non-meaningful ones
15. CROSS-COOCCURRENCE
INDICATORS
hi = a user’s history of an action
P, V, C = the history of all users’ history of some
action (purchase, view, category view)
[PtX] = the pairwise comparison of column to
column—comparison may be across two
actions but is always anchored by primary
r = hp[PtP] + hv[VtP] + hc[CtP] + …
A FINDERBOTS.COM
PRODUCTION
16. CROSS-COOCCURRENCE
SO WHAT?
• The entire user’s clickstream can be used
• Items clicked
• Terms searched
• Categories viewed
• Items shared
• People followed
• Items liked or disliked
• Video watched
• Virtually any action the user can takes makes it
easier to predict what they will like in the future.
A FINDERBOTS.COM
PRODUCTION
17. FROM INDICATOR TO
RECOMMENDATION
r = hp[PtP]
• This actually means to take the user’s history hp and
compare it to rows of the indicator matrix [PtP]
• TF-IDF weighting of indicators would be nice to mitigate
popular items
• Query the indicator with user history
• Sort these by similarity strength and keep only the highest
—you have recommendations
• Sound familiar?
• That is exactly what a search engine does
—except for calculating indicators
A FINDERBOTS.COM
PRODUCTION
18. INDICATOR TYPES
• Cooccurrence and cross-cooccurrence
• Calculated from user actions as discussed
• Create with Mahout 1.0 spark-itemsimilarity
• Content or metadata
• Tags, categories, description text, anything describing an item
• Create with Mahout 1.0 spark-rowsimilarity
• Intrinsic
• Tags, genres, categories, popularity rank, geo-location, anything
describing an item
• Some may be derived from usage data like popularity rank, or hotness
• Is a known or specially calculated property of the item
A FINDERBOTS.COM
PRODUCTION
19. CONTENT INDICATORS
• Finds similar items based on their content—not which users preferred them
• Examples: text descriptions, tags, categories, genres
r = ht[TTt]
r = recommended items, based on tags
ht = a user’s history of an action on items with
tags
[TTt] = item similarity based on similar tags—a content indicator
• This personalizes even content based recommendations
A FINDERBOTS.COM
PRODUCTION
20. INTRINSIC INDICATORS
• Attributes of items
• Genre, subject, category, tags
• Specially calculated based on business rules
• Popularity, hotness
• Based on demographics
• Preferred by people using mobile access
• Preferred by city dwellers
• Preferred by people in warmer climes
• Query by value—not user history
r = v*I
A FINDERBOTS.COM
PRODUCTION
21. THE UNIVERSAL
RECOMMENDER
“Unified” means one query on all indicators at once
r = hp[PtP] + hv[VtP] + hc[CtP] +
ht[TTt] + l*L …
Unified query:
query: users-history-of-purchases; field: purchase
query: users-history-of-views; field: view
query: users-history-of-categories-viewed; field: category
query: users-history-of-purchases; field: tags
query: users-location; field: geo-location-preferred
…
A FINDERBOTS.COM
PRODUCTION
22. ONE OR MANY
• One query—one trip to one scalable search
engine
• Many flavors—customize in the query
• Customize for content context
• Customize for user context
• Profile, location, time, …
• Customize for special indicators
• Trending, hot, new, popular
• All personalized
A FINDERBOTS.COM
PRODUCTION
23. POLISH THE APPLE
• Auto-optimize via explore-exploit (important):
Randomize some returned recs, if they are acted upon they become part of the
new training data and are more likely to be recommended in the future
• Visibility control:
• Don’t show dups or Show dups at some rate
• Filter items the user has already seen
• Generate some intrinsic indicators like hotness, popularity—helps
solve the “cold-start” problem
• Asymmetric train vs query management—for instance query with
most recent actions, train on all ingested
• On-demand cross-validation scoring for tuning purposes
• A/B testing integration with explore-exploit
A FINDERBOTS.COM
PRODUCTION