Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Recommender System with Distributed Representation

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 24 Anzeige

Recommender System with Distributed Representation

Herunterladen, um offline zu lesen

In recent years, Word2Vec and its expansion (Doc2Vec, Paragraph2Vec, etc.) is receiving a lot of attention in the NLP field.

In this slide, we will introduce our approach for applying the Doc2Vec to the item recommender system. And we report the results of the performance evaluation of Doc2Vec-based recommender by using Rakuten Singapore EC data.

In recent years, Word2Vec and its expansion (Doc2Vec, Paragraph2Vec, etc.) is receiving a lot of attention in the NLP field.

In this slide, we will introduce our approach for applying the Doc2Vec to the item recommender system. And we report the results of the performance evaluation of Doc2Vec-based recommender by using Rakuten Singapore EC data.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (14)

Anzeige

Ähnlich wie Recommender System with Distributed Representation (20)

Weitere von Rakuten Group, Inc. (20)

Anzeige

Aktuellste (20)

Recommender System with Distributed Representation

  1. 1. 分散表現を用いた 商品レコメンダーシステムの構築と評価 Recommender System with Distributed Representation Thuy PhiVan1,2, Chen Liu 2 and Yu Hirate2 1. Computational Linguistics Laboratory, NAIST 2.Rakuten Institute of Technology, Rakuten, Inc. {ar-thuy.phivan, chen.liu, yu.hirate}@rakuten.com
  2. 2. 2 1. Distributed Representation for words, docs and categories
  3. 3. 3 Distributed Representations for Words • Distributed representations for words • Similar words are projected into similar vectors. • Relationship between words can be expressed as a simple vector calculation. [T.Mikolov et al. NIPS 2013] • Analogy • v(“woman”) – v(”man”) + v(”king”) = v(“queen”)
  4. 4. 4 2 models in word2vec input projection output input projection output v(t-2) v(t-1) v(t+1) v(t+2) v(t) v(t-2) v(t-1) v(t+1) v(t+2) v(t) CBoW Skip-gram • given context words • predict a probability of a target word • given a target word • predict a probability of context words
  5. 5. 5 Sample results of word2vec trained by Wikipedia data query: nagoya • osaka 0.799002 • chiba 0.762829 • fukuoka 0.755166 • sendai 0.731760 • yokohama 0.729205 • kobe 0.726732 • shiga 0.705707 • niigata 0.699777 • aichi 0.692371 • hyogo 0.687128 • saitama 0.685672 • tokyo 0.671428 • sapporo 0.670466 • kumamoto 0.660786 • japan 0.658769 • kitakyushu 0.654265 • wakayama 0.652783 • shizuoka 0.624380 query: coffee • cocoa 0.603515 • robusta 0.565269 • beans 0.565232 • bananas 0.565207 • cinnamon 0.556771 • citrus 0.547495 • espresso 0.542120 • caff 0.542082 • infusions 0.538069 • tea 0.532565 • cassava 0.524657 • pineapples 0.523557 • coffea 0.512420 • tapioca 0.510727 • sugarcane 0.508203 • yams 0.507347 • avocados 0.507072 • arabica 0.506231
  6. 6. 6 Doc2Vec(Paragraph2Vec) [Q.Le et al. ICML2014] input projection output input projection output v(doc) v(t-1) v(t+1) v(t) v(t-2) v(t-1) v(t) v(t+1) v(doc) PV-DM PV-DBoW v(t-2) • Assign a “Document Vector” to each document • Document vector can be used for • feature of the document • similarity of documents
  7. 7. 7 Category2Vec [Marui et al. NLP2015] https://github.com/rakuten-nlp/category2vec • Assign “Category Vector” to each category. • Each document has its own category information. input projection output input projection output v(doc) v(t-1) v(t+1) v(t) v(t-2) v(t-1) v(t) v(t+1) v(doc) CV-DM CV-DBoW v(t-2) v(cat) v(cat)
  8. 8. 8 2. Applying Doc2Vec to Item Recommender
  9. 9. 9 Recommender Systems in EC service Item2Item recommender • Given an item, show relevant items to the item User2Item recommender • Given a user, show relevant items to the user
  10. 10. 10 Distributed Representation for Users and Items Document : a sequence of words with context. User : a sequence of item views with user’s intention. Set of documents Vectors for words Vectors for documents sim{word, word} sim{doc, word} sim{doc, doc} Set of user behaviors Vectors for items Vectors for users sim{item, item} sim{user, item} sim{user, user}
  11. 11. 11 Dataset Preparation • Service: • Rakuten Singapore www.rakuten.com.sg • Rakuten’s EC service in Singapore • Started from 2014. • Data Source • Purchase History Data • Click Through Data • Term • Jan. 2015 – Oct. 2015
  12. 12. 12 Dataset Preparation (Purchase History Data) • A set of items purchased by the same user. User ID A set of Purchased Items user #1 𝑖𝑡𝑒𝑚1,1, 𝑖𝑡𝑒𝑚1,2 user #2 {𝑖𝑡𝑒𝑚2.1, 𝑖𝑡𝑒𝑚2.2, 𝑖𝑡𝑒𝑚2.3} ⋮ ⋮ user #N {𝑖𝑡𝑒𝑚 𝑁.1}
  13. 13. 13 Dataset Preparation (Click Through Data) • A set of users’ sessions • Session : • A sequence of page views with the same cookie. • A sequence is splitted by time interval > 2 hours. User ID A set of Sessions user #1 𝑖𝑡𝑒𝑚1.1.1, 𝑖𝑡𝑒𝑚1.1.2, ⋯ , 𝑖𝑡𝑒𝑚1.1.𝑛 , 𝑖𝑡𝑒𝑚1,2,1 ⋯ user #2 {𝑖𝑡𝑒𝑚2.1.1, 𝑖𝑡𝑒𝑚2.1.2} ⋮ ⋮ user #N 𝑖𝑡𝑒𝑚 𝑁.1.1, 𝑖𝑡𝑒𝑚 𝑁.1.2, ⋯ , 𝑖𝑡𝑒𝑚 𝑁.1.𝑛 , 𝑖𝑡𝑒𝑚 𝑁,2,1, ⋯ Longer than 2 hours time Session A Session B : session
  14. 14. 14 Dataset Property • More than 60% of sessions finish with one page request. • More than X% of users visited rakuten.com.sg one time only. Distribution of Session Length Distribution of Session Count
  15. 15. 15 Item2Item Recommender (Example) Click Though Data Purchase History Data
  16. 16. 16 3. Evaluation
  17. 17. 17 Evaluation Metrics Training Data 2015/01/01 2015/08/31 Test Data 2015/09/01 2015/10/31 • N is the total number of common users in training and testing data • Hit-rate of the recommender system (RS): hit-rate = Number of hits / N • Each user: RS predicts top-20 items • “Hit”: any items for 1 particular user appear in test data
  18. 18. 18 Evaluations 1. Parameter Optimization • Find an optimal parameter set. • Find important parameters to build a good model 2. Performance Comparison with Conventional Recommender Algorithms • Item Similarity • Matrix Factorization
  19. 19. 19 1. Parameter Optimization Parameter Values Explanation Size [50, 100, 200, 300, 400, 500] Dimensionality of the vectors Window [1, 3, 5, 8, 10, 15] Maximum number items of context that the training algorithm take into account Negative [0, 5, 10, 15, 20, 25] Number of “noise words” should be drawn (usually between 5-20) Sample [0, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8] Sub-sampling of frequent words Min-count [1, ..., 20] Items appear less than this min-count value is ignored Iteration [10,15, 20, 25, 30] Number of iteration for building model • Best setting for parameters Size Window Negative Sample min_count Iteration hit-rate 300 8 10 1e-5 3 20 0.1821
  20. 20. 20 1. Parameter Optimization 13.7 15.5 17.7 18.2 17.8 17.2 0 2 4 6 8 10 12 14 16 18 20 50 100 200 300 400 500 hit-rate(%) Size 15.4 16.9 17.8 18.2 18 18 0 2 4 6 8 10 12 14 16 18 20 1 3 5 8 10 15 hit-rate(%) window 15.9 17.9 18.2 17.6 17.4 17.3 0 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 25 hit-rate(%) Negative 16.216.516.416.7 18.2 15.1 2 0.3 0 2 4 6 8 10 12 14 16 18 20 0 1.00E-02 1.00E-03 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 hit-rate(%) Sample 16.8 18.2 18.9 18.8 18.9 19 18.8 18.7 18.9 18.90 2 4 6 8 10 12 14 16 18 20 1 3 5 7 9 11 13 15 17 19 hit-rate(%) Min_count 16.8 17.8 18.2 18.2 18.2 0 2 4 6 8 10 12 14 16 18 20 10 15 20 25 30 hit-rate(%) Iteration
  21. 21. 21 2. Performance Comparison with Conventional Recommender Algorithms Item Similarity Matrix Factorization U x I = { } = { } Jaccard Sim. of user sets dim=32 max iteration=25
  22. 22. 22 2. Performance Comparison with Conventional Algorithms 0 2 4 6 8 10 12 14 16 18 20 Item Similarity Matrix Factorization Doc2Vec hit-rate(%) Doc2Vec based algorithm performed the best.
  23. 23. 23 Conclusion and Future Works • Conclusion • Developed distributed representation based RS. • Applied it to dataset generated based on Rakuten Singapore click through data. • Confirmed distributed representation based RS performed better than conventional RS algorithms. • Future Works • Distributed representation based RS based on other datasets • Rakuten Singapore Product Data • Rakuten (Japan) Ichiba Click Though Data • Hybrid Model (contents based RS x user behavior based RS) • Testing the real service.
  24. 24. 24 Thank you

×