SlideShare ist ein Scribd-Unternehmen logo
1 von 78
IMA Tutorial (part II): Measurement and modeling of the web and related data sets Andrew Tomkins IBM Almaden Research Center May 5, 2003 Title slide
Setup ,[object Object],[object Object]
Context ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Focus Areas ,[object Object],[object Object],[object Object],[object Object]
One view of the Internet: Inter-Domain Connectivity ,[object Object],[object Object],[object Object],Core Shells: 1 2 3 [Tauro,   Palmer, Siganos, Faloutsos, 2001 Global Internet]
Another view of the web: the hyperlink graph ,[object Object],[object Object],[object Object]
Getting started – structure at the hyperlink level ,[object Object],[object Object],[object Object],[Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, Tomkins, Wiener, 2001]
Terminology ,[object Object],[object Object]
Data ,[object Object],[object Object],[object Object]
Breadth-first search from random starts ,[object Object]
A Picture of (~200M) pages.
Some distance measurements ,[object Object],[object Object],[object Object],[object Object]
Facts (about the crawl). ,[object Object],The distribution of indegrees on the web is given by a Power Law --- Heavy-tailed distribution, with many high-indegree pages (eg, Yahoo)
Analysis of power law Pr [ page has  k  inlinks ]  =~  k -2.1 Pr [ page has >  k  inlinks ]  =~  1/ k Pr [ page has  k  outlinks ]  =~  k -2.7 Corollary:
Component sizes. ,[object Object]
Other observed power laws in the web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Faloutsos, Faloutsos, Faloutsos 99] [Bharat, Chang, Henzinger, Ruhl 02]
More Characterization: Self-Similarity
Ways to Slice the Web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],We call these slices “Thematically Unified Communities”, or TUCs
Self-Similarity on the Web ,[object Object],[object Object],[object Object],[object Object],[object Object]
In particular… ,[object Object],[object Object],[object Object],[object Object],[object Object]
Is this surprising? ,[object Object],[object Object],[object Object],[object Object]
A structural explanation ,[object Object]
The Navigational Backbone Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs
Information Extraction from Large Graphs
Overview WWW Distill KB1 KB2 KB3 Goal:  Create higher-level "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999]
Many approaches to this problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
General approach ,[object Object],[object Object],[object Object]
Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure.
Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common “footprint”: two pages (  ) that both Point to three other pages in common (  )
Communities and cores Example K 2,3 Definition:  A "core" K ij consists of  i  left nodes, j  right nodes, and all left->right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected]
Other footprint structures Newsgroup thread Web ring Corporate partnership Intranet fragment
Subgraph enumeration ,[object Object]
Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing
Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing
The cores are interesting (1) Implicit communities are defined by cores. (2) There are an order of  magnitude more of these.  (10 5+ ) (3) Can grow the core to the community using further processing. Explicit communities. ,[object Object],[object Object],[object Object],[object Object],Implicit communities ,[object Object],[object Object],[object Object],[object Object]
Elementary Schools in Japan ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
So… ,[object Object],[object Object],[object Object],[object Object]
A word on evolution
A word on evolution ,[object Object],[object Object],[object Object],[object Object],[object Object],[Kleinberg02]
Example Time I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Most likely “hidden” sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1
More bursts ,[object Object],[object Object],[object Object],[object Object]
Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities
IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide
Probabilistic generative models ,[object Object],[object Object],[object Object],[object Object]
Models for Power Laws ,[object Object],[object Object],[object Object]
An Introduction to the Power Law ,[object Object],[object Object],[object Object],Exponentially-decaying distribution Power law distribution
Early Observations: Pareto on Income ,[object Object],[object Object],[object Object],[object Object]
Early Observations: Yule/Zipf ,[object Object],[object Object],[object Object],[object Object],[object Object]
Early Observations: Lotka on Citations ,[object Object]
Ranks versus Values ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Equivalence of rank versus value formulation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Bookstein90, Adamic99]
Early modeling work ,[object Object],[object Object],[object Object]
A model of Simon ,[object Object],[object Object],[object Object]
Constructing a book: snapshot at time  t When in the course of human events, it becomes necessary… Current word frequencies:  Let  f(i,t)  be the number of words of count  i  at time  t Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
The Generative Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Constructing a book: snapshot at time  t Current word frequencies:  Let  f(i,t)  be the number of words of count  i  at time  t Pr[“the”] = (1-   ) 1000 / K Pr[“of”] = (1-   ) 600 / K Pr[some count-1 word] = (1-   ) 1 *  f(1,t)  / K K =   if(i,t) Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
What’s going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucket  i  occurs  i  times in the current document … .
What’s going on? 1 With probability    a new word is introduced into the text 2 3 4 5 6
What’s going on? 1 4 How many times do words in this bucket occur? With probability 1-   an existing word is reused 2 3 5 6
What’s going on? 2 3 4 Size of bucket 3 at time  t+1  depends only on sizes of buckets 2 and 3 at time  t ? ? Must show: fraction of balls in 3 rd  bucket approaches some limiting value
Models for power laws in the web graph ,[object Object],[object Object],[object Object],[object Object],[object Object]
Why create such a model? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Random graph models G(n,p) Web indeg > 1000 k23's 4-cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?
Desiderata for a graph model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Page creation on the web ,[object Object],[object Object],Model idea:  new pages add links by "copying" them from existing pages
Generally, would require… ,[object Object],[object Object],[object Object],[object Object],[object Object]
A specific model ,[object Object],[object Object],[object Object],[object Object],[object Object]
Example New node arrives With probability   , it links to a uniformly-chosen page
Example To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. With probability (1-  ), it decides to copy a link.
Degree sequences in this model Pr[page has  k  inlinks]  =~  k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs (   = 1/11 matches web) -(2-  ) (1-  )
Model extensions ,[object Object],[object Object],[object Object],[object Object]
A model of Mandelbrot ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Discussion of Mandelbrot’s model ,[object Object],[object Object]
Heuristically Optimized Trade-offs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Fabrikant, Koutsoupias, Papadimitriou 2002]
Monkeys on Typewriters ,[object Object],[object Object],[object Object],[object Object],[object Object]
Other Distributions ,[object Object],[object Object],[object Object],[object Object]
Quick characterization of lognormal distributions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
One final direction… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKSA LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKScsandit
 
Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...IEEEFINALYEARPROJECTS
 
Exploring Social Media with NodeXL
Exploring Social Media with NodeXL Exploring Social Media with NodeXL
Exploring Social Media with NodeXL Shalin Hai-Jew
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
 
IT6701 Information Management - Unit I
IT6701 Information Management - Unit I  IT6701 Information Management - Unit I
IT6701 Information Management - Unit I pkaviya
 
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKSAPPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKSIJwest
 

Was ist angesagt? (8)

A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKSA LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
 
Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...
 
Exploring Social Media with NodeXL
Exploring Social Media with NodeXL Exploring Social Media with NodeXL
Exploring Social Media with NodeXL
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
tubes_final
tubes_finaltubes_final
tubes_final
 
IT6701 Information Management - Unit I
IT6701 Information Management - Unit I  IT6701 Information Management - Unit I
IT6701 Information Management - Unit I
 
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKSAPPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 

Andere mochten auch

Andere mochten auch (18)

Venture capital investment
Venture capital investmentVenture capital investment
Venture capital investment
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Reporte del clima estados de méxico
Reporte del clima estados de méxicoReporte del clima estados de méxico
Reporte del clima estados de méxico
 
Selecting financial strategies
Selecting financial strategiesSelecting financial strategies
Selecting financial strategies
 
strategic financial management
strategic financial managementstrategic financial management
strategic financial management
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Venture Capital
Venture CapitalVenture Capital
Venture Capital
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Introduction to Venture Capital
Introduction to Venture CapitalIntroduction to Venture Capital
Introduction to Venture Capital
 
Venture capital power point presentation
Venture capital power point presentationVenture capital power point presentation
Venture capital power point presentation
 
Venture capital
Venture capital Venture capital
Venture capital
 
Venture capital presentation
Venture capital presentationVenture capital presentation
Venture capital presentation
 
Venture capital ppt
Venture capital pptVenture capital ppt
Venture capital ppt
 
Financial strategy
Financial strategyFinancial strategy
Financial strategy
 
What is venture capital & venture capital in india
What is venture capital & venture capital in indiaWhat is venture capital & venture capital in india
What is venture capital & venture capital in india
 
Strategic financial management
Strategic financial managementStrategic financial management
Strategic financial management
 
Strategic financial management
Strategic financial managementStrategic financial management
Strategic financial management
 

Ähnlich wie Measurement and modeling of the web and related data sets

2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copyvafopoulos
 
Knowledge graphs on the Web
Knowledge graphs on the WebKnowledge graphs on the Web
Knowledge graphs on the WebArmin Haller
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation FinalEr. Jagrat Gupta
 
P118 gummadi
P118 gummadiP118 gummadi
P118 gummadifoufa31
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...Dr. Aparna Varde
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1Geoffrey Fox
 
Challenges in end-to-end performance
Challenges in end-to-end performanceChallenges in end-to-end performance
Challenges in end-to-end performanceJisc
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!dclsocialmedia
 
B036407011
B036407011B036407011
B036407011theijes
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Geoffrey Fox
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Webdailyye
 

Ähnlich wie Measurement and modeling of the web and related data sets (20)

Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy
 
F14 lec12graphs
F14 lec12graphsF14 lec12graphs
F14 lec12graphs
 
Knowledge graphs on the Web
Knowledge graphs on the WebKnowledge graphs on the Web
Knowledge graphs on the Web
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
P118 gummadi
P118 gummadiP118 gummadi
P118 gummadi
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
 
Network Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and ApplicationsNetwork Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and Applications
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 
Challenges in end-to-end performance
Challenges in end-to-end performanceChallenges in end-to-end performance
Challenges in end-to-end performance
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
 
B036407011
B036407011B036407011
B036407011
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
 

Mehr von Mark J. Feldman

The Role of Venture Capital in the US Economy
The Role of Venture Capital in the US EconomyThe Role of Venture Capital in the US Economy
The Role of Venture Capital in the US EconomyMark J. Feldman
 
Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal TermsMark J. Feldman
 
How Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsHow Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsMark J. Feldman
 
Massachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMassachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMark J. Feldman
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market OpportunityMark J. Feldman
 
Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)Mark J. Feldman
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookMark J. Feldman
 
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...Mark J. Feldman
 
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...Mark J. Feldman
 
Oracle 10g Application Server
Oracle 10g Application ServerOracle 10g Application Server
Oracle 10g Application ServerMark J. Feldman
 
Choosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionChoosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionMark J. Feldman
 
Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004Mark J. Feldman
 
Googlebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsGooglebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsMark J. Feldman
 
Beginners Guide To Venture Capital
Beginners Guide To Venture CapitalBeginners Guide To Venture Capital
Beginners Guide To Venture CapitalMark J. Feldman
 
II Security At Microsoft
II Security At MicrosoftII Security At Microsoft
II Security At MicrosoftMark J. Feldman
 
McDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMcDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMark J. Feldman
 
Email Marketing Tips and Tricks
Email Marketing Tips and TricksEmail Marketing Tips and Tricks
Email Marketing Tips and TricksMark J. Feldman
 
Email Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessEmail Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessMark J. Feldman
 

Mehr von Mark J. Feldman (20)

The Role of Venture Capital in the US Economy
The Role of Venture Capital in the US EconomyThe Role of Venture Capital in the US Economy
The Role of Venture Capital in the US Economy
 
Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal Terms
 
How Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsHow Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen Deals
 
Massachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMassachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech Opportunities
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market Opportunity
 
Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)
 
Email Marketing 101
Email Marketing 101Email Marketing 101
Email Marketing 101
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity Pitchbook
 
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
 
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
 
Oracle 10g Application Server
Oracle 10g Application ServerOracle 10g Application Server
Oracle 10g Application Server
 
Choosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionChoosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware Solution
 
Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004
 
Googlebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsGooglebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendors
 
Beginners Guide To Venture Capital
Beginners Guide To Venture CapitalBeginners Guide To Venture Capital
Beginners Guide To Venture Capital
 
II Security At Microsoft
II Security At MicrosoftII Security At Microsoft
II Security At Microsoft
 
McDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMcDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility Report
 
Sub Prime Explanation
Sub Prime ExplanationSub Prime Explanation
Sub Prime Explanation
 
Email Marketing Tips and Tricks
Email Marketing Tips and TricksEmail Marketing Tips and Tricks
Email Marketing Tips and Tricks
 
Email Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessEmail Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your Business
 

Kürzlich hochgeladen

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Measurement and modeling of the web and related data sets

  • 1. IMA Tutorial (part II): Measurement and modeling of the web and related data sets Andrew Tomkins IBM Almaden Research Center May 5, 2003 Title slide
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. A Picture of (~200M) pages.
  • 12.
  • 13.
  • 14. Analysis of power law Pr [ page has k inlinks ] =~ k -2.1 Pr [ page has > k inlinks ] =~ 1/ k Pr [ page has k outlinks ] =~ k -2.7 Corollary:
  • 15.
  • 16.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. The Navigational Backbone Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs
  • 25. Overview WWW Distill KB1 KB2 KB3 Goal: Create higher-level "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999]
  • 26.
  • 27.
  • 28. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure.
  • 29. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common “footprint”: two pages ( ) that both Point to three other pages in common ( )
  • 30. Communities and cores Example K 2,3 Definition: A "core" K ij consists of i left nodes, j right nodes, and all left->right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected]
  • 31. Other footprint structures Newsgroup thread Web ring Corporate partnership Intranet fragment
  • 32.
  • 33. Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing
  • 34. Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing
  • 35.
  • 36.
  • 37.
  • 38. A word on evolution
  • 39.
  • 40. Example Time I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Most likely “hidden” sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1
  • 41.
  • 42. Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities
  • 43. IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54. Constructing a book: snapshot at time t When in the course of human events, it becomes necessary… Current word frequencies: Let f(i,t) be the number of words of count i at time t Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
  • 55.
  • 56. Constructing a book: snapshot at time t Current word frequencies: Let f(i,t) be the number of words of count i at time t Pr[“the”] = (1-  ) 1000 / K Pr[“of”] = (1-  ) 600 / K Pr[some count-1 word] = (1-  ) 1 * f(1,t) / K K =  if(i,t) Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
  • 57. What’s going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucket i occurs i times in the current document … .
  • 58. What’s going on? 1 With probability  a new word is introduced into the text 2 3 4 5 6
  • 59. What’s going on? 1 4 How many times do words in this bucket occur? With probability 1-  an existing word is reused 2 3 5 6
  • 60. What’s going on? 2 3 4 Size of bucket 3 at time t+1 depends only on sizes of buckets 2 and 3 at time t ? ? Must show: fraction of balls in 3 rd bucket approaches some limiting value
  • 61.
  • 62.
  • 63. Random graph models G(n,p) Web indeg > 1000 k23's 4-cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?
  • 64.
  • 65.
  • 66.
  • 67.
  • 68. Example New node arrives With probability  , it links to a uniformly-chosen page
  • 69. Example To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. With probability (1-  ), it decides to copy a link.
  • 70. Degree sequences in this model Pr[page has k inlinks] =~ k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs (  = 1/11 matches web) -(2-  ) (1-  )
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.