Crawling the Web for Structured Documents

Crawling the Web
for Structured Documents
Julián Urbano, Juan Lloréns, Yorgos Andreadakis and Mónica Marrero
University Carlos III of Madrid · Department of Computer Science
Motivation
Structured Information Retrieval is gaining a lot of interest recently
Almost all research is focused just on XML documents, with initiatives like INEX
But what about other types of document like SQL, DTD, Java source code, RDF, UML?
How can we easily gather real-world structured documents off the Web?
And can we use them to develop collections and search engines for specific structured information?
Motivation
Structured Information Retrieval is gaining a lot of interest recently
Almost all research is focused just on XML documents, with initiatives like INEX
But what about other types of document like SQL, DTD, Java source code, RDF, UML?
How can we easily gather real-world structured documents off the Web?
And can we use them to develop collections and search engines for specific structured information?
Ask General Purpose
Web Search Engines
Follow Link Patterns in Web Repositories
Type Google (P@20) Yahoo (P@20)
XML 25M (0.85) 238K (0.8)
DTD 48K (0.95) 48K (1)
XSD 134K (1) 181K (1)
SQL 104K (1) 152K (0.95)
JAVA 3M (1) 1.6M (1)
Are there really that many documents?
Not everything is relevant (not SQL)
Se we have to develop filters, because:
• Query terms not relevant (comments)
• Many problems with MIME types
• Hierarchical file types (XSD is also XML)
Returns only about 1000 results…
account
bank
deposit
URLs +
additional info
Files Files. . .
Crawler + cfg.
Scheduler
HTML Processor + cfg.
SQL Processor + cfg.
Java Processor + cfg.
Crawler + cfg.
Scheduler
HTML Processor + cfg.
XML Processor + cfg.
. . .
How It Works
Built for Microsoft .net framework and the free SQL Server Express
Collaborative, multi-computer, multi-threaded with hot plug-in
Core detached from the GUI, can be used programmatically
New file types and meta-data can be added on-the-fly with no effort
What do Processors do?
One processor per file type:
1.What additional info we want for these files
(e.g. number of FK definitions, DBMS)
2.Filter files
(e.g. SQL script without table definitions)
3.Process files
(e.g. parse the SQL script and index the table
names, fields and relationships)
Intelligent HTML Processor
Configured per domain:
• Discover URLs, to collect in the DB
and download
• Follow URLs, just to navigate through
(no need to download everything)
URL patters defined in terms of:
• The actual links in webpages
• The HTML structure of webpages
Highly customizable
(see back page)

<CrawlerSettings>
<CrawlerId>
<Threads>
<Thread>
<Priority>
<UriType>
<Target>
<Uri>
...
<Avoid>
<Uri>
...
<TryAnyUriTypeOnEmpty>
<TryAnyUriOnEmpty>
...
<DatabaseHost>
<DatabaseName>
<BatchSize>
<WaitTimeForUris>
<DownloadDirectory>
<DownloadDirectoryDepth>
<DownloadDirectoryWidth>
<DownloadDirectoryPerUriType>
<DownloadDirectoryFullPath>
<UriTypes>
<Type>
<Name>
<CanBeProcessed>
<ProcessorAssembly>
<ProcessorFullname>
<ProcessorConfig>
...
<Keywords>
<Uri>
...
<Notification>
<Server>
<From>
<To>
<Address>
...
<HTMLSettings>
<SpamWords>
<Word>
...
<SpamUris>
<Uri>
...
<UserAgents>
<String>
...
<MaxInMemoryFileSize>
<DownloadBufferLength>
<NormalizeUris>
<UnescapeUris>
<RemoveAnchors>
<Domains>
<Domain>
<Uri>
<MaxLevels>
<CheckNoscript>
<MaxQueueSize>
<MaxTimeoutWait>
<MaxDownloadAttempts>
<MinTimeBetweenRequests>
<MaxTimeBetweenRequests>
<MaxRedirections>
<UseSessions>
<KeepAlive>
<IgnoreCertificate>
<AllowDeflate>
<AllowGZIP>
<InLinkFollow>
<Uri>
...
<InPageFollow>
<Uri>
...
<InLinkDiscover>
<Uri>
...
<InPageDiscover>
<Uri>
...
...
<FileTypes>
<Type>
<UriTypeName>
<MinLength>
<MaxLength>
<Extensions>
<Extension>
...
<MIMETypes>
<Type>
...
...
Create target URLs with patterns and keywords
<Keywords>
<Uri><![CDATA[http://www.google.com/search?q=(?<key>+)(?<key>+)(?<key>+)(?<key>+)+%2B"create+table"+filetype:sql&filter=0]]></Uri>
<Uri><![CDATA[http://sourceforge.net/search/?type_of_search=soft&words=(?<key>+)(?<key>+)(?<key>+)(?<key>+)]]></Uri>
</Keywords>
Get results from Google Search
<InPageFollow>
<Uri><![CDATA[<a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)"[^>]+id=pnnext]]></Uri>
</InPageFollow>
<InPageDiscover>
<Uri><![CDATA[<h3.+?<a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)".+?</a></h3>]]></Uri>
</InPageDiscover>
Navigate through Sourceforge’s projects and get project files
<InLinkFollow>
<Uri><![CDATA[(?<(?<(?<(?<uriuriuriuri>http://sourceforge.net/projects>http://sourceforge.net/projects>http://sourceforge.net/projects>http://sourceforge.net/projects/[^"]+//[^"]+//[^"]+//[^"]+/downloaddownloaddownloaddownload))))]]></Uri>
</InLinkFollow>
<InPageFollow>
<Uri><![CDATA[<a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)">Next →</a>]]></Uri></InPageFollow>
<InLinkDiscover>
<Uri><![CDATA[(?<(?<(?<(?<uriuriuriuri>http://sourceforge.net/projects/[^/]+/)>http://sourceforge.net/projects/[^/]+/)>http://sourceforge.net/projects/[^/]+/)>http://sourceforge.net/projects/[^/]+/)]]></Uri>
</InLinkDiscover>
<InPageDiscover>
<Uri><![CDATA[Please use this <a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)"]]></Uri>
</InPageDiscover>

Crawling the Web for Structured Documents

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Crawling the Web for Structured Documents

Ähnlich wie Crawling the Web for Structured Documents (20)

Mehr von Julián Urbano

Mehr von Julián Urbano (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Crawling the Web for Structured Documents