Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like Sourceforge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections of other types of documents, such as Java source code for software oriented search engines or RDF for semantic searching.
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Crawling the Web for Structured Documents
1. Crawling the Web
for Structured Documents
Julián Urbano, Juan Lloréns, Yorgos Andreadakis and Mónica Marrero
University Carlos III of Madrid · Department of Computer Science
Motivation
Structured Information Retrieval is gaining a lot of interest recently
Almost all research is focused just on XML documents, with initiatives like INEX
But what about other types of document like SQL, DTD, Java source code, RDF, UML?
How can we easily gather real-world structured documents off the Web?
And can we use them to develop collections and search engines for specific structured information?
Motivation
Structured Information Retrieval is gaining a lot of interest recently
Almost all research is focused just on XML documents, with initiatives like INEX
But what about other types of document like SQL, DTD, Java source code, RDF, UML?
How can we easily gather real-world structured documents off the Web?
And can we use them to develop collections and search engines for specific structured information?
Ask General Purpose
Web Search Engines
Follow Link Patterns in Web Repositories
Type Google (P@20) Yahoo (P@20)
XML 25M (0.85) 238K (0.8)
DTD 48K (0.95) 48K (1)
XSD 134K (1) 181K (1)
SQL 104K (1) 152K (0.95)
JAVA 3M (1) 1.6M (1)
Are there really that many documents?
Not everything is relevant (not SQL)
Se we have to develop filters, because:
• Query terms not relevant (comments)
• Many problems with MIME types
• Hierarchical file types (XSD is also XML)
Returns only about 1000 results…
account
bank
deposit
URLs +
additional info
Files Files. . .
Crawler + cfg.
Scheduler
HTML Processor + cfg.
SQL Processor + cfg.
Java Processor + cfg.
Crawler + cfg.
Scheduler
HTML Processor + cfg.
XML Processor + cfg.
. . .
How It Works
Built for Microsoft .net framework and the free SQL Server Express
Collaborative, multi-computer, multi-threaded with hot plug-in
Core detached from the GUI, can be used programmatically
New file types and meta-data can be added on-the-fly with no effort
What do Processors do?
One processor per file type:
1.What additional info we want for these files
(e.g. number of FK definitions, DBMS)
2.Filter files
(e.g. SQL script without table definitions)
3.Process files
(e.g. parse the SQL script and index the table
names, fields and relationships)
Intelligent HTML Processor
Configured per domain:
• Discover URLs, to collect in the DB
and download
• Follow URLs, just to navigate through
(no need to download everything)
URL patters defined in terms of:
• The actual links in webpages
• The HTML structure of webpages
Highly customizable
(see back page)