Exploring the Future Potential of AI-Enabled Smartphone Processors
Measurement and modeling of the web and related data sets
1. IMA Tutorial (part II): Measurement and modeling of the web and related data sets Andrew Tomkins IBM Almaden Research Center May 5, 2003 Title slide
25. Overview WWW Distill KB1 KB2 KB3 Goal: Create higher-level "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999]
26.
27.
28. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure.
29. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common “footprint”: two pages ( ) that both Point to three other pages in common ( )
30. Communities and cores Example K 2,3 Definition: A "core" K ij consists of i left nodes, j right nodes, and all left->right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected]
33. Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing
34. Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing
40. Example Time I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Most likely “hidden” sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1
41.
42. Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities
43. IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54. Constructing a book: snapshot at time t When in the course of human events, it becomes necessary… Current word frequencies: Let f(i,t) be the number of words of count i at time t Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
55.
56. Constructing a book: snapshot at time t Current word frequencies: Let f(i,t) be the number of words of count i at time t Pr[“the”] = (1- ) 1000 / K Pr[“of”] = (1- ) 600 / K Pr[some count-1 word] = (1- ) 1 * f(1,t) / K K = if(i,t) Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
57. What’s going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucket i occurs i times in the current document … .
58. What’s going on? 1 With probability a new word is introduced into the text 2 3 4 5 6
59. What’s going on? 1 4 How many times do words in this bucket occur? With probability 1- an existing word is reused 2 3 5 6
60. What’s going on? 2 3 4 Size of bucket 3 at time t+1 depends only on sizes of buckets 2 and 3 at time t ? ? Must show: fraction of balls in 3 rd bucket approaches some limiting value
61.
62.
63. Random graph models G(n,p) Web indeg > 1000 k23's 4-cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?
64.
65.
66.
67.
68. Example New node arrives With probability , it links to a uniformly-chosen page
69. Example To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. With probability (1- ), it decides to copy a link.
70. Degree sequences in this model Pr[page has k inlinks] =~ k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs ( = 1/11 matches web) -(2- ) (1- )