SlideShare ist ein Scribd-Unternehmen logo
1 von 78
IMA Tutorial (part II): Measurement and modeling of the web and related data sets Andrew Tomkins IBM Almaden Research Center May 5, 2003 Title slide
Setup ,[object Object],[object Object]
Context ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Focus Areas ,[object Object],[object Object],[object Object],[object Object]
One view of the Internet: Inter-Domain Connectivity ,[object Object],[object Object],[object Object],Core Shells: 1 2 3 [Tauro,   Palmer, Siganos, Faloutsos, 2001 Global Internet]
Another view of the web: the hyperlink graph ,[object Object],[object Object],[object Object]
Getting started – structure at the hyperlink level ,[object Object],[object Object],[object Object],[Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, Tomkins, Wiener, 2001]
Terminology ,[object Object],[object Object]
Data ,[object Object],[object Object],[object Object]
Breadth-first search from random starts ,[object Object]
A Picture of (~200M) pages.
Some distance measurements ,[object Object],[object Object],[object Object],[object Object]
Facts (about the crawl). ,[object Object],The distribution of indegrees on the web is given by a Power Law --- Heavy-tailed distribution, with many high-indegree pages (eg, Yahoo)
Analysis of power law Pr [ page has  k  inlinks ]  =~  k -2.1 Pr [ page has >  k  inlinks ]  =~  1/ k Pr [ page has  k  outlinks ]  =~  k -2.7 Corollary:
Component sizes. ,[object Object]
Other observed power laws in the web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Faloutsos, Faloutsos, Faloutsos 99] [Bharat, Chang, Henzinger, Ruhl 02]
More Characterization: Self-Similarity
Ways to Slice the Web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],We call these slices “Thematically Unified Communities”, or TUCs
Self-Similarity on the Web ,[object Object],[object Object],[object Object],[object Object],[object Object]
In particular… ,[object Object],[object Object],[object Object],[object Object],[object Object]
Is this surprising? ,[object Object],[object Object],[object Object],[object Object]
A structural explanation ,[object Object]
The Navigational Backbone Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs
Information Extraction from Large Graphs
Overview WWW Distill KB1 KB2 KB3 Goal:  Create higher-level "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999]
Many approaches to this problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
General approach ,[object Object],[object Object],[object Object]
Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure.
Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common “footprint”: two pages (  ) that both Point to three other pages in common (  )
Communities and cores Example K 2,3 Definition:  A "core" K ij consists of  i  left nodes, j  right nodes, and all left->right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected]
Other footprint structures Newsgroup thread Web ring Corporate partnership Intranet fragment
Subgraph enumeration ,[object Object]
Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing
Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing
The cores are interesting (1) Implicit communities are defined by cores. (2) There are an order of  magnitude more of these.  (10 5+ ) (3) Can grow the core to the community using further processing. Explicit communities. ,[object Object],[object Object],[object Object],[object Object],Implicit communities ,[object Object],[object Object],[object Object],[object Object]
Elementary Schools in Japan ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
So… ,[object Object],[object Object],[object Object],[object Object]
A word on evolution
A word on evolution ,[object Object],[object Object],[object Object],[object Object],[object Object],[Kleinberg02]
Example Time I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Most likely “hidden” sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1
More bursts ,[object Object],[object Object],[object Object],[object Object]
Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities
IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide
Probabilistic generative models ,[object Object],[object Object],[object Object],[object Object]
Models for Power Laws ,[object Object],[object Object],[object Object]
An Introduction to the Power Law ,[object Object],[object Object],[object Object],Exponentially-decaying distribution Power law distribution
Early Observations: Pareto on Income ,[object Object],[object Object],[object Object],[object Object]
Early Observations: Yule/Zipf ,[object Object],[object Object],[object Object],[object Object],[object Object]
Early Observations: Lotka on Citations ,[object Object]
Ranks versus Values ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Equivalence of rank versus value formulation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Bookstein90, Adamic99]
Early modeling work ,[object Object],[object Object],[object Object]
A model of Simon ,[object Object],[object Object],[object Object]
Constructing a book: snapshot at time  t When in the course of human events, it becomes necessary… Current word frequencies:  Let  f(i,t)  be the number of words of count  i  at time  t Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
The Generative Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Constructing a book: snapshot at time  t Current word frequencies:  Let  f(i,t)  be the number of words of count  i  at time  t Pr[“the”] = (1-   ) 1000 / K Pr[“of”] = (1-   ) 600 / K Pr[some count-1 word] = (1-   ) 1 *  f(1,t)  / K K =   if(i,t) Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
What’s going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucket  i  occurs  i  times in the current document … .
What’s going on? 1 With probability    a new word is introduced into the text 2 3 4 5 6
What’s going on? 1 4 How many times do words in this bucket occur? With probability 1-   an existing word is reused 2 3 5 6
What’s going on? 2 3 4 Size of bucket 3 at time  t+1  depends only on sizes of buckets 2 and 3 at time  t ? ? Must show: fraction of balls in 3 rd  bucket approaches some limiting value
Models for power laws in the web graph ,[object Object],[object Object],[object Object],[object Object],[object Object]
Why create such a model? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Random graph models G(n,p) Web indeg > 1000 k23's 4-cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?
Desiderata for a graph model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Page creation on the web ,[object Object],[object Object],Model idea:  new pages add links by "copying" them from existing pages
Generally, would require… ,[object Object],[object Object],[object Object],[object Object],[object Object]
A specific model ,[object Object],[object Object],[object Object],[object Object],[object Object]
Example New node arrives With probability   , it links to a uniformly-chosen page
Example To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. With probability (1-  ), it decides to copy a link.
Degree sequences in this model Pr[page has  k  inlinks]  =~  k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs (   = 1/11 matches web) -(2-  ) (1-  )
Model extensions ,[object Object],[object Object],[object Object],[object Object]
A model of Mandelbrot ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Discussion of Mandelbrot’s model ,[object Object],[object Object]
Heuristically Optimized Trade-offs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Fabrikant, Koutsoupias, Papadimitriou 2002]
Monkeys on Typewriters ,[object Object],[object Object],[object Object],[object Object],[object Object]
Other Distributions ,[object Object],[object Object],[object Object],[object Object]
Quick characterization of lognormal distributions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
One final direction… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKSAPPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
IJwest
 

Was ist angesagt? (8)

A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKSA LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
 
Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...
 
Exploring Social Media with NodeXL
Exploring Social Media with NodeXL Exploring Social Media with NodeXL
Exploring Social Media with NodeXL
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
tubes_final
tubes_finaltubes_final
tubes_final
 
IT6701 Information Management - Unit I
IT6701 Information Management - Unit I  IT6701 Information Management - Unit I
IT6701 Information Management - Unit I
 
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKSAPPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 

Andere mochten auch

Reporte del clima estados de méxico
Reporte del clima estados de méxicoReporte del clima estados de méxico
Reporte del clima estados de méxico
diegorubenrpdriguez
 
Selecting financial strategies
Selecting financial strategiesSelecting financial strategies
Selecting financial strategies
gemdeane1
 
strategic financial management
strategic financial managementstrategic financial management
strategic financial management
Devansh Thapa
 
Venture capital power point presentation
Venture capital power point presentationVenture capital power point presentation
Venture capital power point presentation
Karthik S Raj
 
Strategic financial management
Strategic financial managementStrategic financial management
Strategic financial management
Shaikh Abdulsaeed
 

Andere mochten auch (18)

Venture capital investment
Venture capital investmentVenture capital investment
Venture capital investment
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Reporte del clima estados de méxico
Reporte del clima estados de méxicoReporte del clima estados de méxico
Reporte del clima estados de méxico
 
Selecting financial strategies
Selecting financial strategiesSelecting financial strategies
Selecting financial strategies
 
strategic financial management
strategic financial managementstrategic financial management
strategic financial management
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Venture Capital
Venture CapitalVenture Capital
Venture Capital
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Introduction to Venture Capital
Introduction to Venture CapitalIntroduction to Venture Capital
Introduction to Venture Capital
 
Venture capital power point presentation
Venture capital power point presentationVenture capital power point presentation
Venture capital power point presentation
 
Venture capital
Venture capital Venture capital
Venture capital
 
Venture capital presentation
Venture capital presentationVenture capital presentation
Venture capital presentation
 
Venture capital ppt
Venture capital pptVenture capital ppt
Venture capital ppt
 
Financial strategy
Financial strategyFinancial strategy
Financial strategy
 
What is venture capital & venture capital in india
What is venture capital & venture capital in indiaWhat is venture capital & venture capital in india
What is venture capital & venture capital in india
 
Strategic financial management
Strategic financial managementStrategic financial management
Strategic financial management
 
Strategic financial management
Strategic financial managementStrategic financial management
Strategic financial management
 

Ähnlich wie Measurement and modeling of the web and related data sets

2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy
vafopoulos
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
Er. Jagrat Gupta
 
P118 gummadi
P118 gummadiP118 gummadi
P118 gummadi
foufa31
 

Ähnlich wie Measurement and modeling of the web and related data sets (20)

Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy
 
F14 lec12graphs
F14 lec12graphsF14 lec12graphs
F14 lec12graphs
 
Knowledge graphs on the Web
Knowledge graphs on the WebKnowledge graphs on the Web
Knowledge graphs on the Web
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
P118 gummadi
P118 gummadiP118 gummadi
P118 gummadi
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
 
Network Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and ApplicationsNetwork Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and Applications
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 
Challenges in end-to-end performance
Challenges in end-to-end performanceChallenges in end-to-end performance
Challenges in end-to-end performance
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
 
B036407011
B036407011B036407011
B036407011
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
 

Mehr von Mark J. Feldman

Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal Terms
Mark J. Feldman
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market Opportunity
Mark J. Feldman
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity Pitchbook
Mark J. Feldman
 

Mehr von Mark J. Feldman (20)

The Role of Venture Capital in the US Economy
The Role of Venture Capital in the US EconomyThe Role of Venture Capital in the US Economy
The Role of Venture Capital in the US Economy
 
Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal Terms
 
How Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsHow Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen Deals
 
Massachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMassachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech Opportunities
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market Opportunity
 
Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)
 
Email Marketing 101
Email Marketing 101Email Marketing 101
Email Marketing 101
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity Pitchbook
 
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
 
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
 
Oracle 10g Application Server
Oracle 10g Application ServerOracle 10g Application Server
Oracle 10g Application Server
 
Choosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionChoosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware Solution
 
Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004
 
Googlebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsGooglebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendors
 
Beginners Guide To Venture Capital
Beginners Guide To Venture CapitalBeginners Guide To Venture Capital
Beginners Guide To Venture Capital
 
II Security At Microsoft
II Security At MicrosoftII Security At Microsoft
II Security At Microsoft
 
McDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMcDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility Report
 
Sub Prime Explanation
Sub Prime ExplanationSub Prime Explanation
Sub Prime Explanation
 
Email Marketing Tips and Tricks
Email Marketing Tips and TricksEmail Marketing Tips and Tricks
Email Marketing Tips and Tricks
 
Email Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessEmail Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your Business
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Measurement and modeling of the web and related data sets

  • 1. IMA Tutorial (part II): Measurement and modeling of the web and related data sets Andrew Tomkins IBM Almaden Research Center May 5, 2003 Title slide
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. A Picture of (~200M) pages.
  • 12.
  • 13.
  • 14. Analysis of power law Pr [ page has k inlinks ] =~ k -2.1 Pr [ page has > k inlinks ] =~ 1/ k Pr [ page has k outlinks ] =~ k -2.7 Corollary:
  • 15.
  • 16.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. The Navigational Backbone Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs
  • 25. Overview WWW Distill KB1 KB2 KB3 Goal: Create higher-level "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999]
  • 26.
  • 27.
  • 28. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure.
  • 29. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common “footprint”: two pages ( ) that both Point to three other pages in common ( )
  • 30. Communities and cores Example K 2,3 Definition: A "core" K ij consists of i left nodes, j right nodes, and all left->right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected]
  • 31. Other footprint structures Newsgroup thread Web ring Corporate partnership Intranet fragment
  • 32.
  • 33. Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing
  • 34. Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing
  • 35.
  • 36.
  • 37.
  • 38. A word on evolution
  • 39.
  • 40. Example Time I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Most likely “hidden” sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1
  • 41.
  • 42. Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities
  • 43. IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54. Constructing a book: snapshot at time t When in the course of human events, it becomes necessary… Current word frequencies: Let f(i,t) be the number of words of count i at time t Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
  • 55.
  • 56. Constructing a book: snapshot at time t Current word frequencies: Let f(i,t) be the number of words of count i at time t Pr[“the”] = (1-  ) 1000 / K Pr[“of”] = (1-  ) 600 / K Pr[some count-1 word] = (1-  ) 1 * f(1,t) / K K =  if(i,t) Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
  • 57. What’s going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucket i occurs i times in the current document … .
  • 58. What’s going on? 1 With probability  a new word is introduced into the text 2 3 4 5 6
  • 59. What’s going on? 1 4 How many times do words in this bucket occur? With probability 1-  an existing word is reused 2 3 5 6
  • 60. What’s going on? 2 3 4 Size of bucket 3 at time t+1 depends only on sizes of buckets 2 and 3 at time t ? ? Must show: fraction of balls in 3 rd bucket approaches some limiting value
  • 61.
  • 62.
  • 63. Random graph models G(n,p) Web indeg > 1000 k23's 4-cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?
  • 64.
  • 65.
  • 66.
  • 67.
  • 68. Example New node arrives With probability  , it links to a uniformly-chosen page
  • 69. Example To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. With probability (1-  ), it decides to copy a link.
  • 70. Degree sequences in this model Pr[page has k inlinks] =~ k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs (  = 1/11 matches web) -(2-  ) (1-  )
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.