Click Log Mining CS598

Query Log Mining
Yandex Challenge 2011
Nikita Spirin, Shih-Wen Huang, Shuo Yang, Anirudh Ravula

Search logs are used to improve search
• Learn a ranking functions
– Users click on meaningful results
• Personalize search based on users history
– Previous user searches unveil users interests
• Identify spammers
– Bots click on suspicious websites more often
• Tune contextual advertizing models
• Recommend and disambiguate queries
– See also “java programming” Vs. “java coffee”

Yandex QLM Challenge 2011 goals
• Learn a ranking function
– For a given query provide a list of ordered URLs using
the information from the log
• Plan for today
– Task description
– General framework: learning to rank (L2R)
– Features for L2R
– Preferences extraction for L2R
– Ranking algorithms
– Collaborative Filtering and graph-based approaches
– Experiments
– Future Plans to improve

Task description: Input to the challenge
• Query log
– Query action
SessionID TimePassed QUERY QueryID RegionID ListOfURLs
– Click action
SessionID TimePassed CLICK URLID
• Training relevance labels from {0,1} set
QueryID RegionID URLID RelevanceLabel
• Testing query/region pairs
– The goal is to provide relevant URLs for these new
query/region pairs

Some real input data
• Snapshot of the real Yandex query log
SessionID Time Action QueryId RegionId URL URL URL

• Training relevance labels from {0,1} set
QueryId RegionId URL Relevance

Some statistics about the query log

• Unique queries: 30,717,251
• Unique URLs: 117,093,258
• Sessions: 43,977,859
• Total records in the log: 340,796,067
• Assessed query-region-url triples for the total
query set (training + test): 71,930
• Log size: 17 Gb (doesn’t’t fit into memory)

General Framework: Learning to Rank (L2R)

• Training formalization:
– Given an ordered set of ranks Y = {0,1} (0 < 1)
– Given a set of queries Q = {q1, . . . , qn}
– A list of documents is associated with each query
Dq = {dq1, . . . , dq,n(q)}
– Factor ranking model:
• Xqd = ( f1(q, d), . . . , fm(q, d) ), feature vector for q-d pair
• Goal of L2R:
– Learn a Ranker: X Y

Subtasks of L2R from query logs

• Extract preferences (absolute, pairwise)
form a query log using click-through
statistics
• Generate features (factors) to make a
problem structured
• Learn a ranking algorithm

SVM for L2R = RankSVM
• Extract preferences from a query log based on
some heuristics

Boosting for L2R = RankBoost
• Uses each feature as a decision stump
• Builds a linear weighted ensemble model

Ensemble Approach
• Generate multiple models by varying…
– Feature subsets
– Algorithms parameters
– Ranking models
– Model Subsets
– Averaging strategies (weighted, quality-absed,
etc.)
• Finally average [similar to CombMNZ]

Best result so far

0.642436

Future work
• Add more models
– SVMpref (reduction on L2R to classification)
– Direct optimization of AUC
– Experiment with more sophisticated ensemble
models (MonoRank, etc.)

Click Log Mining CS598

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Click Log Mining CS598

Ähnlich wie Click Log Mining CS598 (20)

Click Log Mining CS598