Sampling national deep Web

Sampling National Deep Web
Denis Shestakov, fname.lname at aalto.fi
Department of Media Technology, Aalto University

DEXA'11, Toulouse, France, 31.08.2011

Outline

● Background
● Our approach: Host-IP cluster random
sampling
● Results
● Conclusions

Background

● Deep Web: web content behind search
interfaces
● See example of interface -------->
● Main problem: hard to crawl, thus
content poorly indexed and not
available for search (hidden)
● Many research problems: roughly 150-
200 works addressing certain aspects
of challenge (e.g., see 'Search interfaces on the
Web: querying and characterizing', Shestakov, 2008)
● "Clearly, the science and practice of
deep web crawling is in its
infancy" (in 'Web crawling', Olston&Najork, 2010)

Background

● What is still unknown (surprisingly):
○ How large is deep Web: number of deep web
resources? amount of content in them? what
portion is indexed?
● So far only several studies addressed this:
○ Bergman, 2001: number, amount of content
○ Chang et al., 2004: number, coverage
○ Shestakov et al., 2007: number
○ Chinese surveys: number
○ ....

Background

● All approaches used so far are not good
● Basically, the idea behind estimating number of
deep web sites:
○ IP address random sampling method (proposed in
1997)
○ Description: take a pool of all IP addresses (~3 billions
currently in use), generate a random sample (~one
million is ok), connect to them, if it serves HTTP crawl it
and search for search interfaces
○ Obtain a number of search interfaces in a sample and
apply sampling math to get an estimate
○ One can restrict to some segment of the Web (e.g.,
national): then pool consists of national IP addresses
only

Virtual Hosting

● Bottleneck: virtual hosting
● When only IP available then URLs for crawl look
like these http://X.Y.Z.W -----> lots of web sites
hosting on X.Z.Y.W missed
● Examples:
○ OVH (hosting company): 65,000 servers host
7,500,000
○ This survey: 670,000 hosts on 80,000 IP
addresses
● You can't ignore it!

Host-IP cluster sampling

● What if a large list of hosts is available?
○ In fact, not very trivial to get one as such a list
should cover a certain web segment well
● Host random sampling can be applied (Shestakov
et al., 2007)
○ Works but with limitations
○ Bottleneck: host aliasing, i.e., different hostnames
lead to the same web site
■ Hard to solve: need to crawl all hosts in the list
(their start web pages)
● Idea: resolve all hosts to their IPs


● Resolve all hosts in the list to their IP addresses
○ A set of host-IP pairs
● Cluster hosts (pairs) by IP
○ IP1: host11,host12, host13, ...
○ IP2: host21,host22, host23, ...
○ ...
○ IPN: hostN1,hostN2, hostN3, ...
● Generate random sample of IP
● Analyze sampled IPs
○ E.g., if IP2 sampled then crawl host21,host22,
host23, ...


● Analyze sampled IPs
○ E.g., if IP2 sampled then crawl host21,host22,
host23, ...
NO
○ While crawling 'unknown' (not in the list)
hosts may be found
■ Crawl only those that either resolved to
IP2 or to IPs that are not among list's IP list
( IP1, IP2,..., IPN)

● Identify search interfaces YES --->
○ Filtering, machine learning, manual check
○ Out of the scope (see ref [14] in the paper)
● Apply sampling formulas (see Section 4.4
of the paper)

Results

● Dataset:
○ ~670 thousand hostnames
○ Obtained from Yandex: good coverage of Russian
Web as of 2006
○ Resolved to ~80 thousands unique IP addresses
○ 77.2% of hosts shared their IPs with at least 20
other hosts <--virtual hosting scale
● 1075 IPs sampled - 6237 hosts in initial crawl
seed
○ Enough if satisfied with NUM+/-25% with 95%
confidence

Comparison:
host-IP vs IP sampling

Conclusion: IP random sampling (used in previous deep
web characterization studies) applied to the same dataset
resulted in estimates that are 3.5 times smaller than
actual numbers (obtained by host-IP)

Conclusion

● Proposed Host-IP clustering technique
○ Superior to IP random sampling
● Accurately characterized a national web segment
○ As of 09/2006, 14,200+/-3800 deep web sites in
Russian Web
● Estimates obtained by Chang et al. (ref [9] in the
paper) are underestimated
● Planning to apply Host-IP to other datasets
○ Main challenge is to obtain a large list of hosts that
reliably covers a certain web segment
● Contact me if interested in Host-IP pairs datasets

Sampling national deep Web

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Sampling national deep Web

Ähnlich wie Sampling national deep Web (20)

Mehr von Denis Shestakov

Mehr von Denis Shestakov (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sampling national deep Web