The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)
1. Outline WIRE Project Web Crawler Conclusions
WIRE: an Open Source Web Information
Retrieval Environment
Carlos Castillo and Ricardo Baeza-Yates
Center for Web Research
http://www.cwr.cl/
CS Dept., University of Chile
OSWIR 2005
Compiegne, France
September 19, 2005
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
2. Outline WIRE Project Web Crawler Conclusions
1 WIRE Project
2 Web Crawler
3 Conclusions
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
3. Outline WIRE Project Web Crawler Conclusions
Motivation
Study subsets of the Web (1-50 million pages)
V We want high performance
V We want to keep as much data as possible
V We want to study scheduling algorithms
X wget is not enough
X Large-scale crawlers were not publicly available
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
4. Outline WIRE Project Web Crawler Conclusions
Motivation
Study subsets of the Web (1-50 million pages)
V We want high performance
V We want to keep as much data as possible
V We want to study scheduling algorithms
X wget is not enough
X Large-scale crawlers were not publicly available
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
5. Outline WIRE Project Web Crawler Conclusions
Motivation
Study subsets of the Web (1-50 million pages)
V We want high performance
V We want to keep as much data as possible
V We want to study scheduling algorithms
X wget is not enough
X Large-scale crawlers were not publicly available
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
6. Outline WIRE Project Web Crawler Conclusions
Motivation
Study subsets of the Web (1-50 million pages)
V We want high performance
V We want to keep as much data as possible
V We want to study scheduling algorithms
X wget is not enough
X Large-scale crawlers were not publicly available
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
7. Outline WIRE Project Web Crawler Conclusions
Motivation
Study subsets of the Web (1-50 million pages)
V We want high performance
V We want to keep as much data as possible
V We want to study scheduling algorithms
X wget is not enough
X Large-scale crawlers were not publicly available
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
8. Outline WIRE Project Web Crawler Conclusions
Motivation
Study subsets of the Web (1-50 million pages)
V We want high performance
V We want to keep as much data as possible
V We want to study scheduling algorithms
X wget is not enough
X Large-scale crawlers were not publicly available
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
9. Outline WIRE Project Web Crawler Conclusions
General Architecture
XML Index XML Search
Focused Crawling
Text Search
Text Index
Crawling Collection
Statistics
Importing Extracting
Clustering Classification
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
10. Outline WIRE Project Web Crawler Conclusions
Characteristics
b Roughly 25,000 lines of open-source C/C++ code
L Asynchronous DNS and HTTP requests, small memory
and processing requirements (except during the analysis)
V Highly configurable: rate of download, parser parameters,
scheduling policy, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
11. Outline WIRE Project Web Crawler Conclusions
Characteristics
b Roughly 25,000 lines of open-source C/C++ code
L Asynchronous DNS and HTTP requests, small memory
and processing requirements (except during the analysis)
V Highly configurable: rate of download, parser parameters,
scheduling policy, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
12. Outline WIRE Project Web Crawler Conclusions
Characteristics
b Roughly 25,000 lines of open-source C/C++ code
L Asynchronous DNS and HTTP requests, small memory
and processing requirements (except during the analysis)
V Highly configurable: rate of download, parser parameters,
scheduling policy, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
13. Outline WIRE Project Web Crawler Conclusions
Web Crawler
Manager
Page score calculations
Long-term scheduling
Seeder Harvester
Collection
Link resolving Short-term scheduling
Robots exclusions Network transfers
Gatherer
Parsing
Link extraction
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
14. Outline WIRE Project Web Crawler Conclusions
Scheduling
Future Current
= Profit
Value Value
}
quality 0.4
P1 freshness 0.1 = Profit: 0.36
0.4 0.04
visited? 1
}
quality 0.7
P2 freshness 0.9 = Profit: 0.07
0.63
0.7
visited? 1
}
quality 0.6
freshness - = Profit: 0.6
P3 0.6 0
visited? 0
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
15. Outline WIRE Project Web Crawler Conclusions
Downloading pages
World Wide Web
Web sites S1 S2 S3 S4 S5 S6 S7
P1,1 P2,1 P3,1 P4,1 P5,1 P6,1 P7,1
P1,2 P2,2 P3,2 P4,2 P5,2 P6,2 P7,2
P1,3 P2,3 P4,3 P5,3 P6,2 P7,3
Web pages
P1,4 P2,4 P4,4 P5,4 P7,4
P2,5 P4,5 P7,5
P2,6
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
16. Outline WIRE Project Web Crawler Conclusions
Storing contents
Document
1 hash( )
Content seen?
2
3
Disk Storage
Free space list
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
17. Outline WIRE Project Web Crawler Conclusions
URL parsing
http://host.domain.com/dir/file.html
1
3
h1('host.domain.com')
h2('235 dir/file.html')
host.domain.com 235
2
235 path/file.html 9421
4
SITE-ID = 235; DOC-ID = 9421
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
18. Outline WIRE Project Web Crawler Conclusions
Practical problems
Z The devil is in the details
§ Varying quality of service
§ Wrong DNS records, temporary DNS failures
§ HTTP responses without headers, with wrong headers,
dates
§ HTML parsing has to be very tolerant
§ Duplicate pages, session-ids, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
19. Outline WIRE Project Web Crawler Conclusions
Practical problems
Z The devil is in the details
§ Varying quality of service
§ Wrong DNS records, temporary DNS failures
§ HTTP responses without headers, with wrong headers,
dates
§ HTML parsing has to be very tolerant
§ Duplicate pages, session-ids, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
20. Outline WIRE Project Web Crawler Conclusions
Practical problems
Z The devil is in the details
§ Varying quality of service
§ Wrong DNS records, temporary DNS failures
§ HTTP responses without headers, with wrong headers,
dates
§ HTML parsing has to be very tolerant
§ Duplicate pages, session-ids, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
21. Outline WIRE Project Web Crawler Conclusions
Practical problems
Z The devil is in the details
§ Varying quality of service
§ Wrong DNS records, temporary DNS failures
§ HTTP responses without headers, with wrong headers,
dates
§ HTML parsing has to be very tolerant
§ Duplicate pages, session-ids, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
22. Outline WIRE Project Web Crawler Conclusions
Practical problems
Z The devil is in the details
§ Varying quality of service
§ Wrong DNS records, temporary DNS failures
§ HTTP responses without headers, with wrong headers,
dates
§ HTML parsing has to be very tolerant
§ Duplicate pages, session-ids, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
23. Outline WIRE Project Web Crawler Conclusions
Practical problems
Z The devil is in the details
§ Varying quality of service
§ Wrong DNS records, temporary DNS failures
§ HTTP responses without headers, with wrong headers,
dates
§ HTML parsing has to be very tolerant
§ Duplicate pages, session-ids, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
24. Outline WIRE Project Web Crawler Conclusions
Data analysis
b Includes link analysis and extraction of statistics (data is
exported as .csv files)
b Reports are generated using LTEXand gnuplot
A
b Report about documents: histograms of size, in- and
out-degree, link scores, page depth, HTTP responses,
age, media types, etc.
b Report about sites: degree distribution in the hostgraph,
maximum depth, pages per site, link structure, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
25. Outline WIRE Project Web Crawler Conclusions
Data analysis
b Includes link analysis and extraction of statistics (data is
exported as .csv files)
b Reports are generated using LTEXand gnuplot
A
b Report about documents: histograms of size, in- and
out-degree, link scores, page depth, HTTP responses,
age, media types, etc.
b Report about sites: degree distribution in the hostgraph,
maximum depth, pages per site, link structure, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
26. Outline WIRE Project Web Crawler Conclusions
Data analysis
b Includes link analysis and extraction of statistics (data is
exported as .csv files)
b Reports are generated using LTEXand gnuplot
A
b Report about documents: histograms of size, in- and
out-degree, link scores, page depth, HTTP responses,
age, media types, etc.
b Report about sites: degree distribution in the hostgraph,
maximum depth, pages per site, link structure, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
27. Outline WIRE Project Web Crawler Conclusions
Data analysis
b Includes link analysis and extraction of statistics (data is
exported as .csv files)
b Reports are generated using LTEXand gnuplot
A
b Report about documents: histograms of size, in- and
out-degree, link scores, page depth, HTTP responses,
age, media types, etc.
b Report about sites: degree distribution in the hostgraph,
maximum depth, pages per site, link structure, etc.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
28. Outline WIRE Project Web Crawler Conclusions
Conclusions
V A tool for Web characterization studies
V Can be extended for other purposes
V Code and documentation available at
http://www.cwr.cl/projects/
Thank you.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
29. Outline WIRE Project Web Crawler Conclusions
Conclusions
V A tool for Web characterization studies
V Can be extended for other purposes
V Code and documentation available at
http://www.cwr.cl/projects/
Thank you.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
30. Outline WIRE Project Web Crawler Conclusions
Conclusions
V A tool for Web characterization studies
V Can be extended for other purposes
V Code and documentation available at
http://www.cwr.cl/projects/
Thank you.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
31. Outline WIRE Project Web Crawler Conclusions
Conclusions
V A tool for Web characterization studies
V Can be extended for other purposes
V Code and documentation available at
http://www.cwr.cl/projects/
Thank you.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
32. Outline WIRE Project Web Crawler Conclusions
Conclusions
V A tool for Web characterization studies
V Can be extended for other purposes
V Code and documentation available at
http://www.cwr.cl/projects/
Thank you.
Carlos Castillo and Ricardo Baeza-Yates Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/