Internet 信息检索中的数学

Internet 信息检索中的数学 Zhi-Ming Ma April 24, 2009, 厦门 Email: mazm@amt.ac.cn http://www.amt.ac.cn/member/mazhiming/index.html

How can google make a ranking of 2,040,000 pages in 0.11 seconds?

A main task of Internet (Web) Information Retrieval = Design and Analysis of Search Engine (SE) Algorithm involving plenty of Mathematics

Inter network is a large scale complex random network The Earth is developing an electronic nervous system, a network with diverse nodes and links are

搜索引擎的流程 Web Links & Anchors Pages Link Map 查询在线部分离线部分 Link Analysis 缓存网页剖析器倒排表 Page & Site 数据库网络图网页爬取器 r 用户界面缓存页面索引编辑器 Page Ranks 网络图生成器 Indexing and Ranking

Static Rank ( 静态排序） ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Dynamic Rank （动态排序） ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Research on Complex Networks and Information Retrieval ,[object Object]

Outlines ,[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object]

 HITS  PageRank 1998 Jon Kleinberg Cornell University ,[object Object],[object Object]

Nevanlinna Prize （ 2006) Jon Kleinberg ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Page Rank , the ranking system used by the Google search engine. ,[object Object],[object Object],[object Object]

Markov chain describing surfing behavior

More generally we may consider personalized d .: PageRank is the unique positive eigenvector: By the strong ergodic theorem:

PageRank as a Function of the Damping Factor Paolo Boldi Massimo Santini Sebastiano Vigna DSI, Università degli Studi di Milano WWW 2005 paper 3.1 Choosing the damping factor 3 General Behaviour 3.2 Getting close to 1 ,[object Object],[object Object],[object Object]

is the limit distribution of P when the starting distribution is uniform, that is, Conjecture 1 :

Research results by our group: ,[object Object],[object Object],[object Object],[object Object]

Weak points of PageRank ,[object Object],[object Object],[object Object],[object Object],BrowseRankSIGIR.ppt

Letting Web Users Vote for Page Importance ,[object Object],[object Object],[object Object],[object Object],[object Object],06/09/09 Yuting Liu@SIGIR'08

Browsing Process ,[object Object],[object Object]

BrowseRank: User browsing graph 06/09/09 Yuting Liu@SIGIR'08 Vertex: Web page Edge: Transition Edge weight w ij : The number of transitions Staying time T i : The time spend on page i Reset probability : Normalized frequencies as first page of session

Mathematical Deduction Maximum likelihood estimation: of staying time

Mathematical Deduction where Therefore

Mathematical Deduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Mathematical Deduction Assume Noise: Chi-square distribution with degree k

Mathematical Deduction ideally we would have: However, due to data sparseness, we encounter challenges……

Mathematical Deduction To tackle this challenge, we turn it into optimization problems :

Mathematical Deduction ,[object Object],[object Object],[object Object],[object Object],[object Object]

Mathematical Deduction ,[object Object],[object Object],[object Object],[object Object]

Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],06/09/09 Yuting Liu@SIGIR'08

Website-level: Find good 06/09/09 Yuting Liu@SIGIR'08

Website-level: Fight spam 06/09/09 Yuting Liu@SIGIR'08

BrowseRank: Letting Web Users Vote for Page Importance Yuting Liu , Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, and Hang Li July 23, 2008, Singapore the 31st Annual International ACM SIGIR Conference on Research & Development on Information Retrieval. Best student paper !

BrowseRank: Letting Web Users Vote for Page Importance ,[object Object],[object Object],[object Object]

Further Studies ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Learning to Rank Model Learning System Ranking System Wei-Ying Ma, Microsoft Research Asia min Loss

learning to rank in IR is a two layer statistical learning ,[object Object],[object Object],[object Object],[object Object]

Document level vs Query level ,[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Microsoft Scholar Fellowship

[object Object],[object Object],the two layer structure of training data is not artificial , but arises from the real world Especially from learning to rank in Information Retrieval

Two-Layer Statistical Learning Framework ,[object Object],[object Object],: instances : descriptions of instances Instances are the objectives which we are concern

[object Object],[object Object],[object Object],[object Object],[object Object],a score (or label) of a document an order on a pair of documents a permutation (list) of documents

Training Process i.i.d. For each i, the associated samples , distribution the training data is denoted as

[object Object],[object Object]

empirical object level loss loss function on expected object level loss

Generalization Analysis based on Stability Theory ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Definition: We say a algorithm possesses: Object –level uniform leave-one-out stability Abbreviated as Object –level stability, if: Function learned from training data Function learned from training data

Generalization based on Object-level Stability Object-level stability The number of training objects With probability at least

Note: if , then the bound makes sense. This condition can be satisfied in many practical cases. As case studies, we investigate Ranking SVM and RankBoost. We show that after introducing query-level normalization to its objective function, Ranking SVM will have query-level stability. For RankBoost , the query-level stability can be achieved if we introduce both query-level normalization and regularization to its objective function . These analyses agree largely with our experiments and the experiments in Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, 2006 [5] and [11].

[object Object],[object Object],Query-level Empirical Risk Generalization Bound:

Generalization Bounds Comparison ,[object Object],Generalization Bound: Generalization Bound: Modified RSVM

RankBoost with Query-level Normalization and Regularization ,[object Object],query-level normalization cannot make RankBoost have query-level stability. ,[object Object],[object Object],[object Object],[object Object]

Experimental Results (I) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],Experimental Results (II)

Future Problems and Challenges ,[object Object],[object Object],[object Object],[object Object]

Internet 信息检索中的数学

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Internet 信息检索中的数学

Ähnlich wie Internet 信息检索中的数学 (20)

Mehr von Xu jiakon

Mehr von Xu jiakon (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Internet 信息检索中的数学