SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
Sampling National Deep Web
   Denis Shestakov, fname.lname at aalto.fi
Department of Media Technology, Aalto University




                          DEXA'11, Toulouse, France, 31.08.2011
Outline



● Background
● Our approach: Host-IP cluster random
  sampling
● Results
● Conclusions
Background

● Deep Web: web content behind search
  interfaces
● See example of interface              -------->
● Main problem: hard to crawl, thus
  content poorly indexed and not
  available for search (hidden)
● Many research problems: roughly 150-
  200 works addressing certain aspects
  of challenge (e.g., see 'Search interfaces on the
  Web: querying and characterizing', Shestakov, 2008)
● "Clearly, the science and practice of
  deep web crawling is in its
  infancy" (in 'Web crawling', Olston&Najork, 2010)
Background

● What is still unknown (surprisingly):
   ○ How large is deep Web: number of deep web
     resources? amount of content in them? what
     portion is indexed?
● So far only several studies addressed this:
   ○ Bergman, 2001: number, amount of content
   ○ Chang et al., 2004: number, coverage
   ○ Shestakov et al., 2007: number
   ○ Chinese surveys: number
   ○ ....
Background

● All approaches used so far are not good
● Basically, the idea behind estimating number of
  deep web sites:
   ○ IP address random sampling method (proposed in
     1997)
   ○ Description: take a pool of all IP addresses (~3 billions
     currently in use), generate a random sample (~one
     million is ok), connect to them, if it serves HTTP crawl it
     and search for search interfaces
   ○ Obtain a number of search interfaces in a sample and
     apply sampling math to get an estimate
   ○ One can restrict to some segment of the Web (e.g.,
     national): then pool consists of national IP addresses
     only
Virtual Hosting

● Bottleneck: virtual hosting
● When only IP available then URLs for crawl look
  like these http://X.Y.Z.W -----> lots of web sites
  hosting on X.Z.Y.W missed
● Examples:
    ○ OVH (hosting company): 65,000 servers host
      7,500,000
    ○ This survey: 670,000 hosts on 80,000 IP
      addresses
● You can't ignore it!
Host-IP cluster sampling

● What if a large list of hosts is available?
   ○ In fact, not very trivial to get one as such a list
     should cover a certain web segment well
● Host random sampling can be applied (Shestakov
  et al., 2007)
   ○ Works but with limitations
   ○ Bottleneck: host aliasing, i.e., different hostnames
     lead to the same web site
       ■ Hard to solve: need to crawl all hosts in the list
         (their start web pages)
● Idea: resolve all hosts to their IPs
Host-IP cluster sampling

● Resolve all hosts in the list to their IP addresses
   ○ A set of host-IP pairs
● Cluster hosts (pairs) by IP
   ○ IP1: host11,host12, host13, ...
   ○ IP2: host21,host22, host23, ...
   ○ ...
   ○ IPN: hostN1,hostN2, hostN3, ...
● Generate random sample of IP
● Analyze sampled IPs
   ○ E.g., if IP2 sampled then crawl host21,host22,
     host23, ...
Host-IP cluster sampling

● Analyze sampled IPs
   ○ E.g., if IP2 sampled then crawl host21,host22,
     host23, ...
                                                           NO
   ○ While crawling 'unknown' (not in the list)
     hosts may be found
       ■ Crawl only those that either resolved to
         IP2 or to IPs that are not among list's IP list
         ( IP1, IP2,..., IPN)

● Identify search interfaces                YES --->
   ○ Filtering, machine learning, manual check
   ○ Out of the scope (see ref [14] in the paper)
● Apply sampling formulas (see Section 4.4
 of the paper)
Results

● Dataset:
   ○ ~670 thousand hostnames
   ○ Obtained from Yandex: good coverage of Russian
     Web as of 2006
   ○ Resolved to ~80 thousands unique IP addresses
   ○ 77.2% of hosts shared their IPs with at least 20
     other hosts <--virtual hosting scale
● 1075 IPs sampled - 6237 hosts in initial crawl
  seed
   ○ Enough if satisfied with NUM+/-25% with 95%
     confidence
Results
Comparison:
            host-IP vs IP sampling




Conclusion: IP random sampling (used in previous deep
web characterization studies) applied to the same dataset
resulted in estimates that are 3.5 times smaller than
actual numbers (obtained by host-IP)
Conclusion

● Proposed Host-IP clustering technique
   ○ Superior to IP random sampling
● Accurately characterized a national web segment
   ○ As of 09/2006, 14,200+/-3800 deep web sites in
     Russian Web
● Estimates obtained by Chang et al. (ref [9] in the
  paper) are underestimated
● Planning to apply Host-IP to other datasets
   ○ Main challenge is to obtain a large list of hosts that
     reliably covers a certain web segment
● Contact me if interested in Host-IP pairs datasets
Thank you!
Questions?

Weitere ähnliche Inhalte

Ähnlich wie Sampling national deep Web

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopDenis Shestakov
 
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtatzafargilani
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchBill Liu
 
JRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform FurtherJRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform FurtherCharles Nutter
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBAndrei KUCHARAVY
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101APNIC
 
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Olaf Hartig
 
Visualizing botnets with t-SNE
Visualizing botnets with t-SNEVisualizing botnets with t-SNE
Visualizing botnets with t-SNEmuayyad alsadi
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Programming for Performance
Programming for PerformanceProgramming for Performance
Programming for PerformanceCris Holdorph
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureLuiz Henrique Zambom Santana
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaMushfekur Rahman
 
DNS Magnitude - DNSheads Vienna #6
DNS Magnitude - DNSheads Vienna #6DNS Magnitude - DNSheads Vienna #6
DNS Magnitude - DNSheads Vienna #6Alex Mayrhofer
 
Analyzing network infrastructure with Neo4j
Analyzing network infrastructure with Neo4jAnalyzing network infrastructure with Neo4j
Analyzing network infrastructure with Neo4jYaroslav Lukyanov
 

Ähnlich wie Sampling national deep Web (20)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtat
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
 
JRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform FurtherJRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform Further
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
 
Slides
SlidesSlides
Slides
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101
 
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Visualizing botnets with t-SNE
Visualizing botnets with t-SNEVisualizing botnets with t-SNE
Visualizing botnets with t-SNE
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Programming for Performance
Programming for PerformanceProgramming for Performance
Programming for Performance
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 
DNS Magnitude - DNSheads Vienna #6
DNS Magnitude - DNSheads Vienna #6DNS Magnitude - DNSheads Vienna #6
DNS Magnitude - DNSheads Vienna #6
 
Analyzing network infrastructure with Neo4j
Analyzing network infrastructure with Neo4jAnalyzing network infrastructure with Neo4j
Analyzing network infrastructure with Neo4j
 

Mehr von Denis Shestakov

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the WebDenis Shestakov
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery systemDenis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database SystemsDenis Shestakov
 

Mehr von Denis Shestakov (6)

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Kürzlich hochgeladen

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Kürzlich hochgeladen (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Sampling national deep Web

  • 1. Sampling National Deep Web Denis Shestakov, fname.lname at aalto.fi Department of Media Technology, Aalto University DEXA'11, Toulouse, France, 31.08.2011
  • 2. Outline ● Background ● Our approach: Host-IP cluster random sampling ● Results ● Conclusions
  • 3. Background ● Deep Web: web content behind search interfaces ● See example of interface --------> ● Main problem: hard to crawl, thus content poorly indexed and not available for search (hidden) ● Many research problems: roughly 150- 200 works addressing certain aspects of challenge (e.g., see 'Search interfaces on the Web: querying and characterizing', Shestakov, 2008) ● "Clearly, the science and practice of deep web crawling is in its infancy" (in 'Web crawling', Olston&Najork, 2010)
  • 4. Background ● What is still unknown (surprisingly): ○ How large is deep Web: number of deep web resources? amount of content in them? what portion is indexed? ● So far only several studies addressed this: ○ Bergman, 2001: number, amount of content ○ Chang et al., 2004: number, coverage ○ Shestakov et al., 2007: number ○ Chinese surveys: number ○ ....
  • 5. Background ● All approaches used so far are not good ● Basically, the idea behind estimating number of deep web sites: ○ IP address random sampling method (proposed in 1997) ○ Description: take a pool of all IP addresses (~3 billions currently in use), generate a random sample (~one million is ok), connect to them, if it serves HTTP crawl it and search for search interfaces ○ Obtain a number of search interfaces in a sample and apply sampling math to get an estimate ○ One can restrict to some segment of the Web (e.g., national): then pool consists of national IP addresses only
  • 6. Virtual Hosting ● Bottleneck: virtual hosting ● When only IP available then URLs for crawl look like these http://X.Y.Z.W -----> lots of web sites hosting on X.Z.Y.W missed ● Examples: ○ OVH (hosting company): 65,000 servers host 7,500,000 ○ This survey: 670,000 hosts on 80,000 IP addresses ● You can't ignore it!
  • 7. Host-IP cluster sampling ● What if a large list of hosts is available? ○ In fact, not very trivial to get one as such a list should cover a certain web segment well ● Host random sampling can be applied (Shestakov et al., 2007) ○ Works but with limitations ○ Bottleneck: host aliasing, i.e., different hostnames lead to the same web site ■ Hard to solve: need to crawl all hosts in the list (their start web pages) ● Idea: resolve all hosts to their IPs
  • 8. Host-IP cluster sampling ● Resolve all hosts in the list to their IP addresses ○ A set of host-IP pairs ● Cluster hosts (pairs) by IP ○ IP1: host11,host12, host13, ... ○ IP2: host21,host22, host23, ... ○ ... ○ IPN: hostN1,hostN2, hostN3, ... ● Generate random sample of IP ● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ...
  • 9. Host-IP cluster sampling ● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ... NO ○ While crawling 'unknown' (not in the list) hosts may be found ■ Crawl only those that either resolved to IP2 or to IPs that are not among list's IP list ( IP1, IP2,..., IPN) ● Identify search interfaces YES ---> ○ Filtering, machine learning, manual check ○ Out of the scope (see ref [14] in the paper) ● Apply sampling formulas (see Section 4.4 of the paper)
  • 10. Results ● Dataset: ○ ~670 thousand hostnames ○ Obtained from Yandex: good coverage of Russian Web as of 2006 ○ Resolved to ~80 thousands unique IP addresses ○ 77.2% of hosts shared their IPs with at least 20 other hosts <--virtual hosting scale ● 1075 IPs sampled - 6237 hosts in initial crawl seed ○ Enough if satisfied with NUM+/-25% with 95% confidence
  • 12. Comparison: host-IP vs IP sampling Conclusion: IP random sampling (used in previous deep web characterization studies) applied to the same dataset resulted in estimates that are 3.5 times smaller than actual numbers (obtained by host-IP)
  • 13. Conclusion ● Proposed Host-IP clustering technique ○ Superior to IP random sampling ● Accurately characterized a national web segment ○ As of 09/2006, 14,200+/-3800 deep web sites in Russian Web ● Estimates obtained by Chang et al. (ref [9] in the paper) are underestimated ● Planning to apply Host-IP to other datasets ○ Main challenge is to obtain a large list of hosts that reliably covers a certain web segment ● Contact me if interested in Host-IP pairs datasets