SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Query Log Mining
          Yandex Challenge 2011
Nikita Spirin, Shih-Wen Huang, Shuo Yang, Anirudh Ravula
Search logs are used to improve search
• Learn a ranking functions
  – Users click on meaningful results
• Personalize search based on users history
  – Previous user searches unveil users interests
• Identify spammers
  – Bots click on suspicious websites more often
• Tune contextual advertizing models
• Recommend and disambiguate queries
  – See also “java programming” Vs. “java coffee”
Yandex QLM Challenge 2011 goals
• Learn a ranking function
  – For a given query provide a list of ordered URLs using
    the information from the log
• Plan for today
  –   Task description
  –   General framework: learning to rank (L2R)
  –   Features for L2R
  –   Preferences extraction for L2R
  –   Ranking algorithms
  –   Collaborative Filtering and graph-based approaches
  –   Experiments
  –   Future Plans to improve
Task description: Input to the challenge
• Query log
  – Query action
  SessionID TimePassed QUERY QueryID RegionID ListOfURLs
  – Click action
  SessionID TimePassed CLICK URLID
• Training relevance labels from {0,1} set
  QueryID RegionID URLID RelevanceLabel
• Testing query/region pairs
  – The goal is to provide relevant URLs for these new
    query/region pairs
Some real input data
• Snapshot of the real Yandex query log
SessionID   Time Action QueryId RegionId URL          URL   URL




• Training relevance labels from {0,1} set
                QueryId   RegionId   URL       Relevance
Some statistics about the query log

• Unique queries: 30,717,251
• Unique URLs: 117,093,258
• Sessions: 43,977,859
• Total records in the log: 340,796,067
• Assessed query-region-url triples for the total
  query set (training + test): 71,930
• Log size: 17 Gb (doesn’t’t fit into memory)
General Framework: Learning to Rank (L2R)

• Training formalization:
  – Given an ordered set of ranks Y = {0,1} (0 < 1)
  – Given a set of queries Q = {q1, . . . , qn}
  – A list of documents is associated with each query
    Dq = {dq1, . . . , dq,n(q)}
  – Factor ranking model:
     • Xqd = ( f1(q, d), . . . , fm(q, d) ), feature vector for q-d pair
• Goal of L2R:
  – Learn a Ranker: X            Y
Subtasks of L2R from query logs

• Extract preferences (absolute, pairwise)
  form a query log using click-through
  statistics
• Generate features (factors) to make a
  problem structured
• Learn a ranking algorithm
SVM for L2R = RankSVM
• Extract preferences from a query log based on
  some heuristics
Boosting for L2R = RankBoost
• Uses each feature as a decision stump
• Builds a linear weighted ensemble model
Ensemble Approach
• Generate multiple models by varying…
  – Feature subsets
  – Algorithms parameters
  – Ranking models
  – Model Subsets
  – Averaging strategies (weighted, quality-absed,
    etc.)
• Finally average [similar to CombMNZ]
Best result so far




             0.642436
Future work
• Add more models
  – SVMpref (reduction on L2R to classification)
  – Direct optimization of AUC
  – Experiment with more sophisticated ensemble
    models (MonoRank, etc.)

Weitere ähnliche Inhalte

Ähnlich wie Click Log Mining CS598

Florian Douetteau @ Dataiku
Florian Douetteau @ DataikuFlorian Douetteau @ Dataiku
Florian Douetteau @ DataikuPAPIs.io
 
Re-ranking Web Documents Using Personal Preferences
Re-ranking Web Documents Using Personal PreferencesRe-ranking Web Documents Using Personal Preferences
Re-ranking Web Documents Using Personal Preferencesshubhamsangal
 
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Lucidworks
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentuŁukasz Grala
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Cataldo Musto
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Institute of Contemporary Sciences
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Optimizing the performance of Chamilo LMS
Optimizing the performance of Chamilo LMSOptimizing the performance of Chamilo LMS
Optimizing the performance of Chamilo LMSChamilo Association
 
Optimizing the performance of your LMS
Optimizing the performance of your LMSOptimizing the performance of your LMS
Optimizing the performance of your LMSPatrick Roth
 
Requirements engineering iv
Requirements engineering ivRequirements engineering iv
Requirements engineering ivindrisrozas
 
Data monstersrealtimeetl new
Data monstersrealtimeetl newData monstersrealtimeetl new
Data monstersrealtimeetl newGreenM
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Thanh Tran
 
Optimising Queries - Series 1 Query Optimiser Architecture
Optimising Queries - Series 1 Query Optimiser ArchitectureOptimising Queries - Series 1 Query Optimiser Architecture
Optimising Queries - Series 1 Query Optimiser ArchitectureDAGEOP LTD
 
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.GeeksLab Odessa
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonMark Conway
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimizationWBUTTUTORIALS
 

Ähnlich wie Click Log Mining CS598 (20)

Florian Douetteau @ Dataiku
Florian Douetteau @ DataikuFlorian Douetteau @ Dataiku
Florian Douetteau @ Dataiku
 
Re-ranking Web Documents Using Personal Preferences
Re-ranking Web Documents Using Personal PreferencesRe-ranking Web Documents Using Personal Preferences
Re-ranking Web Documents Using Personal Preferences
 
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentu
 
ICPC06.ppt
ICPC06.pptICPC06.ppt
ICPC06.ppt
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
E3 chap-12
E3 chap-12E3 chap-12
E3 chap-12
 
Optimizing the performance of Chamilo LMS
Optimizing the performance of Chamilo LMSOptimizing the performance of Chamilo LMS
Optimizing the performance of Chamilo LMS
 
Optimizing the performance of your LMS
Optimizing the performance of your LMSOptimizing the performance of your LMS
Optimizing the performance of your LMS
 
Requirements engineering iv
Requirements engineering ivRequirements engineering iv
Requirements engineering iv
 
Data monstersrealtimeetl new
Data monstersrealtimeetl newData monstersrealtimeetl new
Data monstersrealtimeetl new
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 
Optimising Queries - Series 1 Query Optimiser Architecture
Optimising Queries - Series 1 Query Optimiser ArchitectureOptimising Queries - Series 1 Query Optimiser Architecture
Optimising Queries - Series 1 Query Optimiser Architecture
 
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
AlphaPy
AlphaPyAlphaPy
AlphaPy
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimization
 

Click Log Mining CS598

  • 1. Query Log Mining Yandex Challenge 2011 Nikita Spirin, Shih-Wen Huang, Shuo Yang, Anirudh Ravula
  • 2. Search logs are used to improve search • Learn a ranking functions – Users click on meaningful results • Personalize search based on users history – Previous user searches unveil users interests • Identify spammers – Bots click on suspicious websites more often • Tune contextual advertizing models • Recommend and disambiguate queries – See also “java programming” Vs. “java coffee”
  • 3. Yandex QLM Challenge 2011 goals • Learn a ranking function – For a given query provide a list of ordered URLs using the information from the log • Plan for today – Task description – General framework: learning to rank (L2R) – Features for L2R – Preferences extraction for L2R – Ranking algorithms – Collaborative Filtering and graph-based approaches – Experiments – Future Plans to improve
  • 4. Task description: Input to the challenge • Query log – Query action SessionID TimePassed QUERY QueryID RegionID ListOfURLs – Click action SessionID TimePassed CLICK URLID • Training relevance labels from {0,1} set QueryID RegionID URLID RelevanceLabel • Testing query/region pairs – The goal is to provide relevant URLs for these new query/region pairs
  • 5. Some real input data • Snapshot of the real Yandex query log SessionID Time Action QueryId RegionId URL URL URL • Training relevance labels from {0,1} set QueryId RegionId URL Relevance
  • 6. Some statistics about the query log • Unique queries: 30,717,251 • Unique URLs: 117,093,258 • Sessions: 43,977,859 • Total records in the log: 340,796,067 • Assessed query-region-url triples for the total query set (training + test): 71,930 • Log size: 17 Gb (doesn’t’t fit into memory)
  • 7. General Framework: Learning to Rank (L2R) • Training formalization: – Given an ordered set of ranks Y = {0,1} (0 < 1) – Given a set of queries Q = {q1, . . . , qn} – A list of documents is associated with each query Dq = {dq1, . . . , dq,n(q)} – Factor ranking model: • Xqd = ( f1(q, d), . . . , fm(q, d) ), feature vector for q-d pair • Goal of L2R: – Learn a Ranker: X Y
  • 8. Subtasks of L2R from query logs • Extract preferences (absolute, pairwise) form a query log using click-through statistics • Generate features (factors) to make a problem structured • Learn a ranking algorithm
  • 9. SVM for L2R = RankSVM • Extract preferences from a query log based on some heuristics
  • 10. Boosting for L2R = RankBoost • Uses each feature as a decision stump • Builds a linear weighted ensemble model
  • 11. Ensemble Approach • Generate multiple models by varying… – Feature subsets – Algorithms parameters – Ranking models – Model Subsets – Averaging strategies (weighted, quality-absed, etc.) • Finally average [similar to CombMNZ]
  • 12. Best result so far 0.642436
  • 13. Future work • Add more models – SVMpref (reduction on L2R to classification) – Direct optimization of AUC – Experiment with more sophisticated ensemble models (MonoRank, etc.)