SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Modern recommender system in
large content website
Cyrus Chiu @ PyConTW 2020
1
Youtube - Landing Page
2
Cyrus Chiu
About
www.linkedin.com/in/iamcyruschiu
3
a ML engineer at a financial (technology) company
Use case in PIXNET
User2Article (personal)
Article2Tag (non-personal)
Article2Article (both)
Tag2Tag (non-personal)
Tag2Article (personal)
4
How we building a recommender system
Basic
• Data infrastructure

• Data collection

• User tracking

• Some rules or model 

• Position on the webpage
Vehicle
Say: segments
5
How we building a recommender system
Advanced
• Data lifecycle

• Scalability

• User profiling

• Item profiling

• Item retrieving 

• Item ranking
Tech
Vehicle
User / Item
Information
6
Retrieving
Ranking
Recommendation
Data Warehouse
What is 2–phase recommender system
Vehicle Food Pet
Down scaling millions
of item to hundreds for
each user
Solving the diversity
problem by using hybrid
retrieval strategy.
• Large scale

• Less features

• Simple model

• Speed
• Small scale

• More features

• Sophisticated model

• Performance
Millions
Hundreds
Tens
7
Personalized
Item Retrieving
…
Phase 1: Retrieving Reading Sequence
Recommender
Ordered list
of items
8
Define the similarity between item A and B
According to the content
According to the user behavior
9
Strategy Search Engine
Word
Embedding
Item
Embedding
Frequently
Occurring
Together
Hot
Features
• Easy to use
• Multiple query
type
• Another cost to
maintain your
Elasticsearch
cluster
• Unexplainable
• Vector space
• High
computational
cost
• Cold start
problem
• Data sparsity
issue
• Vector space
• High
computational
cost
• Cold start
problem
• Data sparsity
issue
• Explainable
• Easy to
maintain
• Explainable
• Easy to use
• Suitable for
any situations
Recommended
Some Retrieving Methods
10
Phase 2: Ranking
RecommenderRecommender
f ( user, item, context )
Candidate of items
Ordered list
of itemsUser
Information
Item
Information
Context
Information
11
Model Logsitc Regression GBDT + LR Factorization Machine DNN
Features
• Fast
• Highly explainable
• Manually crafting
features
• Limited effective
• Fast
• Highly explainable
• Capability to
generate feature
interaction
• Slow
• Explainable
• Good at learning
feature interactions
• Exhausting
resources
• More slowly
• Unexplainable
• Learning
sophisticated feature
interactions
automatically
• Exhausting more
resources
• Wide & Deep, Deep &
Cross, DeepFM,
xDeepFM, …
Evolution of the prediction model
12
RecSys on DMP
13
Example: Find the pair-wise distance
?
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(X)
Solution 1
Solution 2
from sklearn.metrics.pairwise import pairwise_distances
pairwise_distances(X, metric="cosine")
14
15
import pyspark (1)
X_bc = sc.broadcast(X)
n_part = 100
batch_size = 64
data_size = X.shape[0]
batch_vectors_rdd = sc.parallelize(
[X[i: i+batch_size]
for i in range(0, data_size, batch_size)
],
n_part)
batch_vectors_rdd.mapPartitions(
lambda x: cosine_similarity(x, X_bc.value))
Solution: RDD
X_bc
batch
16
import pyspark (2)
@F.udf(returnType=DoubleType())
def get_distance(i, j):
sim = cosine_similarity(i, j)
return sim
vector_sdf.alias("i").join(
vector_sdf.alias(“j"),

col("i.id") < col("j.id")).select(
F.col("i.id").alias("i"),
F.col("j.id").alias("j"),
get_distance("i.vector", "j.vector").alias(
"cosine_similarity")
)
Solution: DataFrame
I
J = I CROSS JOIN
CosineSimilarity (
Column I.vector,
Column J.vector
)
17
1K 10K 100K
sklearn 0.1 1.5 OOM
pyspark 2 5.1 70
Faiss 0.05 0.6 21
Comparison
Calculating distances between coordinates from a 32*1k/10k/100k matrix

(32 core, 32 GB memory)
https://github.com/facebookresearch/faiss18
Workflows of the data project
Data Management Programming Project Management
• ETL pipeline design

• Data infra design

• Metalayer design

• Data toolkit
• SQL

• Framework design

• Library development

• Parallel computing

• Machine learning
• Git flow

• Review
19
Using Apache Airflow to schedule and monitor workflows
20
Online A/B Testing
21
Youtube - Landing Page
22
Youtube - Video Page
23
Thank You
24

Weitere ähnliche Inhalte

Ähnlich wie Modern recommender system in large content website

MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
MongoDB
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
PyData
 

Ähnlich wie Modern recommender system in large content website (20)

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
kdd2015
kdd2015kdd2015
kdd2015
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNO
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
Machine Learning with Data Science Online Course | Learn and Build
 Machine Learning with Data Science Online Course | Learn and Build  Machine Learning with Data Science Online Course | Learn and Build
Machine Learning with Data Science Online Course | Learn and Build
 
Paolo Kreth - Persistence layers for microservices – the converged database a...
Paolo Kreth - Persistence layers for microservices – the converged database a...Paolo Kreth - Persistence layers for microservices – the converged database a...
Paolo Kreth - Persistence layers for microservices – the converged database a...
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
Deep feature synthesis
Deep feature synthesisDeep feature synthesis
Deep feature synthesis
 

Kürzlich hochgeladen

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Kürzlich hochgeladen (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 

Modern recommender system in large content website

  • 1. Modern recommender system in large content website Cyrus Chiu @ PyConTW 2020 1
  • 3. Cyrus Chiu About www.linkedin.com/in/iamcyruschiu 3 a ML engineer at a financial (technology) company
  • 4. Use case in PIXNET User2Article (personal) Article2Tag (non-personal) Article2Article (both) Tag2Tag (non-personal) Tag2Article (personal) 4
  • 5. How we building a recommender system Basic • Data infrastructure • Data collection • User tracking • Some rules or model • Position on the webpage Vehicle Say: segments 5
  • 6. How we building a recommender system Advanced • Data lifecycle • Scalability • User profiling • Item profiling • Item retrieving • Item ranking Tech Vehicle User / Item Information 6
  • 7. Retrieving Ranking Recommendation Data Warehouse What is 2–phase recommender system Vehicle Food Pet Down scaling millions of item to hundreds for each user Solving the diversity problem by using hybrid retrieval strategy. • Large scale • Less features • Simple model • Speed • Small scale • More features • Sophisticated model • Performance Millions Hundreds Tens 7
  • 8. Personalized Item Retrieving … Phase 1: Retrieving Reading Sequence Recommender Ordered list of items 8
  • 9. Define the similarity between item A and B According to the content According to the user behavior 9
  • 10. Strategy Search Engine Word Embedding Item Embedding Frequently Occurring Together Hot Features • Easy to use • Multiple query type • Another cost to maintain your Elasticsearch cluster • Unexplainable • Vector space • High computational cost • Cold start problem • Data sparsity issue • Vector space • High computational cost • Cold start problem • Data sparsity issue • Explainable • Easy to maintain • Explainable • Easy to use • Suitable for any situations Recommended Some Retrieving Methods 10
  • 11. Phase 2: Ranking RecommenderRecommender f ( user, item, context ) Candidate of items Ordered list of itemsUser Information Item Information Context Information 11
  • 12. Model Logsitc Regression GBDT + LR Factorization Machine DNN Features • Fast • Highly explainable • Manually crafting features • Limited effective • Fast • Highly explainable • Capability to generate feature interaction • Slow • Explainable • Good at learning feature interactions • Exhausting resources • More slowly • Unexplainable • Learning sophisticated feature interactions automatically • Exhausting more resources • Wide & Deep, Deep & Cross, DeepFM, xDeepFM, … Evolution of the prediction model 12
  • 14. Example: Find the pair-wise distance ? from sklearn.metrics.pairwise import cosine_similarity cosine_similarity(X) Solution 1 Solution 2 from sklearn.metrics.pairwise import pairwise_distances pairwise_distances(X, metric="cosine") 14
  • 15. 15
  • 16. import pyspark (1) X_bc = sc.broadcast(X) n_part = 100 batch_size = 64 data_size = X.shape[0] batch_vectors_rdd = sc.parallelize( [X[i: i+batch_size] for i in range(0, data_size, batch_size) ], n_part) batch_vectors_rdd.mapPartitions( lambda x: cosine_similarity(x, X_bc.value)) Solution: RDD X_bc batch 16
  • 17. import pyspark (2) @F.udf(returnType=DoubleType()) def get_distance(i, j): sim = cosine_similarity(i, j) return sim vector_sdf.alias("i").join( vector_sdf.alias(“j"),
 col("i.id") < col("j.id")).select( F.col("i.id").alias("i"), F.col("j.id").alias("j"), get_distance("i.vector", "j.vector").alias( "cosine_similarity") ) Solution: DataFrame I J = I CROSS JOIN CosineSimilarity ( Column I.vector, Column J.vector ) 17
  • 18. 1K 10K 100K sklearn 0.1 1.5 OOM pyspark 2 5.1 70 Faiss 0.05 0.6 21 Comparison Calculating distances between coordinates from a 32*1k/10k/100k matrix
 (32 core, 32 GB memory) https://github.com/facebookresearch/faiss18
  • 19. Workflows of the data project Data Management Programming Project Management • ETL pipeline design • Data infra design • Metalayer design • Data toolkit • SQL • Framework design • Library development • Parallel computing • Machine learning • Git flow • Review 19
  • 20. Using Apache Airflow to schedule and monitor workflows 20
  • 22. Youtube - Landing Page 22
  • 23. Youtube - Video Page 23