SlideShare a Scribd company logo
1 of 26
Download to read offline
Relating Web Characteristics
         Ricardo Baeza-Yates
            Carlos Castillo
         Universidad de Chile
Agenda
    Introduction
•
    Link-based ranking
•
    Web structure
•
    Web characteristics
•
    Web usage
•
    Web dynamics
•
    Conclusions
•

              Relating Web Characteristics
Introduction: Sample
    Web sample: .CL domain on year 2000
•
    670,000 pages in 7,500 domains
•
    15kb average page size
•
    Collection from the TodoCL web search
•
    engine




               Relating Web Characteristics
Introduction: Emphasis

• Broder et al.: Graph Structure on the
  Web (2000)
  – Page-based structure based on strongly
    connected components
  – The Web graph is not a random graph
  – Process: cut & paste model
• Our is mostly a site-based analysis
  – Trying to make Web structure meaningful
              Relating Web Characteristics
Introduction: The Empire




       Relating Web Characteristics
Introduction: One Map




      Relating Web Characteristics
Link ranking: Pagerank
                                  Pages that point
                                  to page p
                                              k
                q
Pagerank ( p ) = + (1 − q )∑ Pagerank (ri )
                N          i =1


                                                  Currently used by
                                                  Google
Probability of a
                                                  Brin & Page, 1998
random jump over
number of pages

                   Relating Web Characteristics
Link ranking: Hubs &
          Authorities
• HITS algorithm (Kleinberg, 1998)
• A good authority is a page pointed by
  good hubs, so we assume that it has
  good content
• A good hub is a page that points to
  good authorities, so we assume it is a
  good set of links
• Linear system calculated by numerical
  iteration
              Relating Web Characteristics
Link ranking: Distribution
                            <2% with relevant
                            Pagerank




9% with relevant
                                                  2-3% with relevant
hub score
                                                  authority score




                   Relating Web Characteristics
Link ranking: Correlation



                                         Hub score,
                                       authority score
                                       and Pagerank
                                        do not seem
                                      to be correlated



       Relating Web Characteristics
Link ranking: Sites

• Which measure to use for sites ?
• Average score
  – But good sites can have lots of bad pages
• Maximum score
  – But one good page cannot be all that is
    needed to be a good site
• Sum of the scores of all pages
  – Natural for Pagerank
               Relating Web Characteristics
Link ranking: Sites Graph

                   90% relevant site-Pagerank




It’s harder to have a
good hub than a
good authority (site)



                    Relating Web Characteristics
Web Structure: Basis
• The Web graph has structure:

                 MAIN


 IN
                                            OUT



  ISLANDS

             Relating Web Characteristics
Web Structure: Basis (cont.)
• The MAIN component has structure:




        MAIN IN
                                        MAIN OUT
                  MAIN MAIN


IN
             MAIN NORM                             OUT

              Relating Web Characteristics
Web Structure: Sketch




      Relating Web Characteristics
Web Structure: Degree




      Relating Web Characteristics
Web Structure: Sizes




     Relating Web Characteristics
Web Structure: Preferences




        Relating Web Characteristics
Web Structure: Preferences

                  OUT
                                          MAIN
                                          OUT
    OUT



                 MAIN                     MAIN
                 MAIN                     MAIN



    Real           ODP                TodoCL
           Relating Web Characteristics
Web Structure: Various




      Relating Web Characteristics
Web Structure: Link Scores




        Relating Web Characteristics
Web Dynamics: Ages
• The kernel of the Web comes from the
  past




             Relating Web Characteristics
Web Dynamics: By
  Component




    Relating Web Characteristics
Web Dynamics: Pagerank


            Pagerank is biased
            against newer pages




       Relating Web Characteristics
Web Dynamics: Hubs &
                       Authorities
Authority Score




                                        Hub Score


                              Age (months)

                        Relating Web Characteristics
Conclusions
• Pagerank/HITS do not seem to be
  correlated
  – And Pagerank is biased to older pages
• Site ranking can help to make good
  human-selected directories
• Finding good pages is not so simple
• Characterizing Web structure gives
  valuable insight
  – Web Graph Mining is just starting
               Relating Web Characteristics

More Related Content

Viewers also liked

Bioinformatics Meets Information Retrieval: State of the Art and a Case Study
Bioinformatics Meets Information Retrieval: State of the Art and a Case StudyBioinformatics Meets Information Retrieval: State of the Art and a Case Study
Bioinformatics Meets Information Retrieval: State of the Art and a Case StudyEloisa Vargiu
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data miningMai Mustafa
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slidesmahavir_a
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint201014161
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataPier Luca Lanzi
 

Viewers also liked (8)

Google PageRank
Google PageRankGoogle PageRank
Google PageRank
 
Bioinformatics Meets Information Retrieval: State of the Art and a Case Study
Bioinformatics Meets Information Retrieval: State of the Art and a Case StudyBioinformatics Meets Information Retrieval: State of the Art and a Case Study
Bioinformatics Meets Information Retrieval: State of the Art and a Case Study
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Web Mining
Web Mining Web Mining
Web Mining
 
Search Engine Demystified
Search Engine DemystifiedSearch Engine Demystified
Search Engine Demystified
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
 

Similar to Relating Key Web Characteristics Such as Structure, Link Ranking and Dynamics

A4Uexpo Internal Linking Structure
A4Uexpo Internal Linking StructureA4Uexpo Internal Linking Structure
A4Uexpo Internal Linking StructureRoy Huiskes
 
Seo Best Practices
Seo Best PracticesSeo Best Practices
Seo Best PracticesKent Schnepp
 
Page rank by university of michagain.ppt
Page rank by university of michagain.pptPage rank by university of michagain.ppt
Page rank by university of michagain.pptrayyverma
 
Jonathan Stewart iCrossing UK Natural Search Link Building Basics
Jonathan Stewart iCrossing UK Natural Search Link Building BasicsJonathan Stewart iCrossing UK Natural Search Link Building Basics
Jonathan Stewart iCrossing UK Natural Search Link Building BasicsiCrossing
 
Getting the Most out of Linkscape
Getting the Most out of LinkscapeGetting the Most out of Linkscape
Getting the Most out of LinkscapeNick Gerner
 
Technical SEO (Pagination & Crawling) by Adam Audette
Technical SEO (Pagination & Crawling) by Adam AudetteTechnical SEO (Pagination & Crawling) by Adam Audette
Technical SEO (Pagination & Crawling) by Adam AudetteAdam Audette
 
Your Website. What's Possible and What Should You Strive to Achieve? A Case S...
Your Website. What's Possible and What Should You Strive to Achieve? A Case S...Your Website. What's Possible and What Should You Strive to Achieve? A Case S...
Your Website. What's Possible and What Should You Strive to Achieve? A Case S...Site-Seeker, Inc.
 
Alec Mitchell Relationship Building Defining And Querying Complex Relatio...
Alec Mitchell   Relationship Building   Defining And Querying Complex Relatio...Alec Mitchell   Relationship Building   Defining And Querying Complex Relatio...
Alec Mitchell Relationship Building Defining And Querying Complex Relatio...Vincenzo Barone
 
Gopetfriendly.com seo Pitch ppt
Gopetfriendly.com seo Pitch pptGopetfriendly.com seo Pitch ppt
Gopetfriendly.com seo Pitch pptSiddheshSawant54
 
Lifting The Lid On Search Marketing
Lifting The Lid On Search MarketingLifting The Lid On Search Marketing
Lifting The Lid On Search Marketingwater&stone
 
SEO Evatt INMA Dallas
SEO Evatt INMA DallasSEO Evatt INMA Dallas
SEO Evatt INMA DallasSteven Evatt
 
Windows Share Point Services V3 Presentation
Windows Share Point Services V3 PresentationWindows Share Point Services V3 Presentation
Windows Share Point Services V3 PresentationADRose
 
Seocertification TRAINING Courses
Seocertification TRAINING CoursesSeocertification TRAINING Courses
Seocertification TRAINING CoursesDr,Saini Anand
 

Similar to Relating Key Web Characteristics Such as Structure, Link Ranking and Dynamics (20)

A4Uexpo Internal Linking Structure
A4Uexpo Internal Linking StructureA4Uexpo Internal Linking Structure
A4Uexpo Internal Linking Structure
 
Seo Best Practices
Seo Best PracticesSeo Best Practices
Seo Best Practices
 
Page rank by university of michagain.ppt
Page rank by university of michagain.pptPage rank by university of michagain.ppt
Page rank by university of michagain.ppt
 
Site Analysis
Site AnalysisSite Analysis
Site Analysis
 
Jonathan Stewart iCrossing UK Natural Search Link Building Basics
Jonathan Stewart iCrossing UK Natural Search Link Building BasicsJonathan Stewart iCrossing UK Natural Search Link Building Basics
Jonathan Stewart iCrossing UK Natural Search Link Building Basics
 
Stsinks.com seo Pitch ppt
Stsinks.com seo Pitch pptStsinks.com seo Pitch ppt
Stsinks.com seo Pitch ppt
 
Getting the Most out of Linkscape
Getting the Most out of LinkscapeGetting the Most out of Linkscape
Getting the Most out of Linkscape
 
Technical SEO (Pagination & Crawling) by Adam Audette
Technical SEO (Pagination & Crawling) by Adam AudetteTechnical SEO (Pagination & Crawling) by Adam Audette
Technical SEO (Pagination & Crawling) by Adam Audette
 
Imarks linkbuilding
Imarks linkbuildingImarks linkbuilding
Imarks linkbuilding
 
Your Website. What's Possible and What Should You Strive to Achieve? A Case S...
Your Website. What's Possible and What Should You Strive to Achieve? A Case S...Your Website. What's Possible and What Should You Strive to Achieve? A Case S...
Your Website. What's Possible and What Should You Strive to Achieve? A Case S...
 
Google
GoogleGoogle
Google
 
Alec Mitchell Relationship Building Defining And Querying Complex Relatio...
Alec Mitchell   Relationship Building   Defining And Querying Complex Relatio...Alec Mitchell   Relationship Building   Defining And Querying Complex Relatio...
Alec Mitchell Relationship Building Defining And Querying Complex Relatio...
 
Gopetfriendly.com seo Pitch ppt
Gopetfriendly.com seo Pitch pptGopetfriendly.com seo Pitch ppt
Gopetfriendly.com seo Pitch ppt
 
Lifting The Lid On Search Marketing
Lifting The Lid On Search MarketingLifting The Lid On Search Marketing
Lifting The Lid On Search Marketing
 
SEO Evatt INMA Dallas
SEO Evatt INMA DallasSEO Evatt INMA Dallas
SEO Evatt INMA Dallas
 
Seo Basic Training
Seo Basic TrainingSeo Basic Training
Seo Basic Training
 
Windows Share Point Services V3 Presentation
Windows Share Point Services V3 PresentationWindows Share Point Services V3 Presentation
Windows Share Point Services V3 Presentation
 
Seocertification TRAINING Courses
Seocertification TRAINING CoursesSeocertification TRAINING Courses
Seocertification TRAINING Courses
 
Pagerank
PagerankPagerank
Pagerank
 
Page ranking factors
Page ranking factorsPage ranking factors
Page ranking factors
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Relating Key Web Characteristics Such as Structure, Link Ranking and Dynamics

  • 1. Relating Web Characteristics Ricardo Baeza-Yates Carlos Castillo Universidad de Chile
  • 2. Agenda Introduction • Link-based ranking • Web structure • Web characteristics • Web usage • Web dynamics • Conclusions • Relating Web Characteristics
  • 3. Introduction: Sample Web sample: .CL domain on year 2000 • 670,000 pages in 7,500 domains • 15kb average page size • Collection from the TodoCL web search • engine Relating Web Characteristics
  • 4. Introduction: Emphasis • Broder et al.: Graph Structure on the Web (2000) – Page-based structure based on strongly connected components – The Web graph is not a random graph – Process: cut & paste model • Our is mostly a site-based analysis – Trying to make Web structure meaningful Relating Web Characteristics
  • 5. Introduction: The Empire Relating Web Characteristics
  • 6. Introduction: One Map Relating Web Characteristics
  • 7. Link ranking: Pagerank Pages that point to page p k q Pagerank ( p ) = + (1 − q )∑ Pagerank (ri ) N i =1 Currently used by Google Probability of a Brin & Page, 1998 random jump over number of pages Relating Web Characteristics
  • 8. Link ranking: Hubs & Authorities • HITS algorithm (Kleinberg, 1998) • A good authority is a page pointed by good hubs, so we assume that it has good content • A good hub is a page that points to good authorities, so we assume it is a good set of links • Linear system calculated by numerical iteration Relating Web Characteristics
  • 9. Link ranking: Distribution <2% with relevant Pagerank 9% with relevant 2-3% with relevant hub score authority score Relating Web Characteristics
  • 10. Link ranking: Correlation Hub score, authority score and Pagerank do not seem to be correlated Relating Web Characteristics
  • 11. Link ranking: Sites • Which measure to use for sites ? • Average score – But good sites can have lots of bad pages • Maximum score – But one good page cannot be all that is needed to be a good site • Sum of the scores of all pages – Natural for Pagerank Relating Web Characteristics
  • 12. Link ranking: Sites Graph 90% relevant site-Pagerank It’s harder to have a good hub than a good authority (site) Relating Web Characteristics
  • 13. Web Structure: Basis • The Web graph has structure: MAIN IN OUT ISLANDS Relating Web Characteristics
  • 14. Web Structure: Basis (cont.) • The MAIN component has structure: MAIN IN MAIN OUT MAIN MAIN IN MAIN NORM OUT Relating Web Characteristics
  • 15. Web Structure: Sketch Relating Web Characteristics
  • 16. Web Structure: Degree Relating Web Characteristics
  • 17. Web Structure: Sizes Relating Web Characteristics
  • 18. Web Structure: Preferences Relating Web Characteristics
  • 19. Web Structure: Preferences OUT MAIN OUT OUT MAIN MAIN MAIN MAIN Real ODP TodoCL Relating Web Characteristics
  • 20. Web Structure: Various Relating Web Characteristics
  • 21. Web Structure: Link Scores Relating Web Characteristics
  • 22. Web Dynamics: Ages • The kernel of the Web comes from the past Relating Web Characteristics
  • 23. Web Dynamics: By Component Relating Web Characteristics
  • 24. Web Dynamics: Pagerank Pagerank is biased against newer pages Relating Web Characteristics
  • 25. Web Dynamics: Hubs & Authorities Authority Score Hub Score Age (months) Relating Web Characteristics
  • 26. Conclusions • Pagerank/HITS do not seem to be correlated – And Pagerank is biased to older pages • Site ranking can help to make good human-selected directories • Finding good pages is not so simple • Characterizing Web structure gives valuable insight – Web Graph Mining is just starting Relating Web Characteristics