A news search engine is a software system that is designed to search news articles published in various news sites. We
develop a near real time news search for Marathi and Hindi languages using the open source crawler, Nutch and the
indexer, Solr. In this work, we in particular address the question of what architectural changes need to be done in Nutch
and Solr to build a news search system which demands continuous updating on its crawl and index. We list all the
challenges faced to have a near real time index for news search and provide solutions for the same. Our customized open
source framework can be adapted to news search solution for any language.
We describe a methodology for maintaining freshness of the crawl by creating an incremental crawling architecture. We
show how this architecture maintains crawl freshness through a case study of Marathi and Hindi news search. We propose
a method of prioritizing URLs for fetching in a resource constrained scenario. We describe a technique to score a URL
before it is fetched, based on its own features and features from its parent URLs that were crawled previously.
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization in Information Retrieval
1. Proactive Approach to Estimate the Re-crawl Period for Resource
Minimization in Information Retrieval
Emmanuel Rajarathnam1
and K Arthi2
1
Research Scholar, Veltech University, Avadi, Chennai, Tamil Nadu, India.
2
Professor, Veltech University, Avadi, Chennai, Tamil Nadu, India.
e_rajarathnam@rediffmail.com, arthi@veltechuniv.edu.in
ABSTRACT
A news search engine is a software system that is designed to search news articles published in various news sites. We
develop a near real time news search for Marathi and Hindi languages using the open source crawler, Nutch and the
indexer, Solr. In this work, we in particular address the question of what architectural changes need to be done in Nutch
and Solr to build a news search system which demands continuous updating on its crawl and index. We list all the
challenges faced to have a near real time index for news search and provide solutions for the same. Our customized open
source framework can be adapted to news search solution for any language.
We describe a methodology for maintaining freshness of the crawl by creating an incremental crawling architecture. We
show how this architecture maintains crawl freshness through a case study of Marathi and Hindi news search. We propose
a method of prioritizing URLs for fetching in a resource constrained scenario. We describe a technique to score a URL
before it is fetched, based on its own features and features from its parent URLs that were crawled previously.
KEYWORDS
Re-Crawl, Fetch time estimation, proactive approach
1. INTRODUCTION
Crawling is a process of collecting documents from web and indexing them for retrieval. World Wide Web has a huge
collection of documents that need to be processed and stored for a generic web search. There exists other search systems
viz., enterprise search, focussed search, domain search, etc., that concentrate on a particular section of the web. The
crawling infrastructure required to process the entire WWW is so high that the small scale/academic organizations cannot
afford them. In most of the academic/small scale organizations, crawling infrastructure has constraints on availability of
resources like processing power, bandwidth, memory, etc.
Even though the open source crawling frameworks helps in crawling the entire web, the resource constraints limit
the number of documents being crawled. Resource constraints demands three major tasks for efficient retrieval:
1. Limit maximum number of URLS crawled in each depth
2. URL prioritization
3. URL scheduling
In the absence of any resource constraints, none of the above is required as we have enough resources to crawl
any number of URLs at a time. Resource constraints force the crawler to fetch limited number of URLs in each depth.
Hence, a demand to carefully choose the URLs for crawling in each depth exists. Also it highly recommended to decide the
re-crawl period carefully so as not to overload the crawler. This case is prevalent in a news search engine. A news search
engine is a software system designed to search news articles published in various sites. News gets added almost every
minute (perhaps lesser) on different sites. It is cumbersome to manually go through each newspaper site and browse for
relevant news. For obtaining news relevant to the users information need from different sources, a news search engine is
required. Any news search engine is designed to fulfil 3 core objectives:
1. Satisfying information need of user: Satisfying information need of the user forms the primary objective of any
search engine. The query can be news in any particular location or on recent event. A news search engine should
be able to retrieve relevant news articles related to user’s information need and rank them based on relevance and
published date and time.
2. One stop solution for all news articles: News articles are available in different categories like sports, business,
general, etc. A news search engine acts as a single repository of news from various sources and categories. This
enables user to look for relevant news in a single location without manually browsing each site and search for
relevant news articles.
3. Multiple Perspectives of same news: News can be looked at from different perspectives. Presenting all these views
to the user is important. For e.g. some newspaper author may be biased towards a particular political party and
praise them for their move while others can criticize them for their move.
In this work, we provide the solution for resource minimization in crawling by pro-actively choose the re-crawl period
for every document in the search index. Also we evaluate the proposed method by building a news search for Marathi and
Hindi.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, August 2017
362 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. 2. RELATED WORK
While incremental crawling and scheduling helps in maintaining crawl up-to-date, it requires a lot of resources which may
not be available in a small scale organization. In small organizations or academic institutions, resources available are
limited which puts limit on the throughput of the crawler. In such a scenario, crawling important URLs first is required.
Scheduling policies help in optimizing the bandwidth required for crawling by increasing the crawl period of pages which
are not updated frequently. However, even within URLs, which have more frequency of updates, we need to have a priority
on the URLs to be fetched first.
A lot of work exists on classifying URLs based on page content. Page content based classification can be helpful
in deciding the next fetch interval of the URL. However, classification of URLs before fetching is required for prioritizing
URLs in current depth.
There are many research contributions for prioritizing URLs during fetching. Some of them use purely URL based
features, while some of them use parent information, contextual information, etc. Let us look at few of them.
Min-Yen Kan (2004) quantified the performance of web page classification using only the URL features(URL
text) against anchor text, title text and page text, showed that URL features when treated correctly, exceeds the
performance of some source-document based features.
Min-Yen Kan et al. (2005) added URL features, component length, content, orthography, token sequence and
precedence to model URL. The resulting features, used in supervised maximum entropy modelling, significantly improve
over existing URL features.
Fish search (Bra et. al., 1994), one of the first dynamic Web search algorithms, takes as input a seed URL and a
search query, and dynamically builds a priority list (initialized to the seed URL) of the next URLs (hereafter called nodes)
to be explored. As each document’s text becomes available, it is analysed by a scoring component evaluating whether it is
relevant or irrelevant to the search query (1-0 value) and, based on that score. A heuristic decides whether to pursue the
exploration in that direction or not.
Shark search algorithm (Hersovic, 1998), a more aggressive algorithm, instead of binary evaluation of document
relevance, returns a score between 0 and 1 in order to evaluate the relevance of documents to a given query, which has
direct impact on priority list. Shark search calculates potential score of the children not only by propagating ancestral
relevance scores deeper down the hierarchy, but also by making use of the meta-information contained in the links to
documents.
Jamali et al. (2006) used the link structure analysis with the similarity of the page context to determine the
download pages priority, while Xu & Zuo (2007) use the hyperlinks to discover the relationships between the web pages.
3. PROPOSED METHOD
News sites have different frequencies of updating. There are multiple RSS links within a site for different categories of
news articles. Each of these links gets updated at different time intervals. Thus, in the process of crawling RSS links
continuously, it is important to prioritize URLs based on update frequency. Continuous crawling of all URLs consume
more resources as it fetches and parses pages which do not change frequently. This causes significant delay in re-fetch of
remaining pages.
To prioritize URLs, we perform fetching of RSS links using an adaptive scheduler. The scheduler assigns lesser fetch
intervals to URLs which get updated more frequently as compared to other URLs.
Following are the different parameters used in adaptive scheduler:
Default Fetch Interval: We set the default fetch interval time for re-crawling a page as 3500 seconds. This ensures
that the page after this period is expected to change by the crawler.
Maximum Fetch Interval: The maximum amount of time after which the page becomes available for re-crawling.
This value is set to 43200 seconds which amounts to 12hrs.
Incremental Rate: After re-crawling a page, if the page found to be not modified, we need to increment the next
fetch time for the page. This value (set to 0.4) indicates the rate at which increment of the next fetch time should
take place.
Decremental Rate: Rate at which the next fetch time of a page to be decided if the page being re-crawled now has
been modified from the last time. This value is set to 0.2.
Min-Max Interval: The next fetch time of a page should have a boundary values indicated by this interval.
To configure appropriate values for minimum fetch interval and maximum fetch interval we carried out few
experiments. Initially we had set minimum interval as 20 seconds and maximum interval as 24 hours to study the rate of
change of news articles. The value of 20 seconds is much less than the time taken to complete one crawl cycle. Since we
target daily newspaper sites, we have kept the maximum interval as 24 hours.
3.1. Proactive Re-crawl Estimator
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
363 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. While crawling RSS links, we need to ensure that only initial set of seed URLs gets crawled every time. Any outlines from
the RSS page should be discarded. To do so we set the property update-additions to false. But by doing so, the
subdocuments generated after parsing the RSS page also gets discarded. The possible solutions to this problem are:-
3.1.1. Revert Update-Additions property
In this case, the documents generated after parsing won’t be discarded. However, a lot of junk URLs may be added to the
fetch list which may not be related to news increasing the bandwidth usage.
3.1.2. Different Crawl Container in Every iteration
Another way to approach the problem is to use a different crawl database during each crawl along with keeping update-
additions to true. The advantages of using this method are:-
Only the seed URLs gets crawled every time.
No junk pages in crawl.
The disadvantage of this method is that proactive estimation requires information from the crawl database for setting
the next fetch interval. This information in not available while doing a fresh crawl. Thus in this case, proactive estimation
won’t work.
3.1.3. New Filter
The third approach is to create a new filter which filters out non RSS URLs from the fetchlist. The main advantages of this
method are:-
No sub document is discarded.
Only RSS links gets crawled. Hence crawl contains no junk/ad pages. Proactive estimation works properly as we
use the same crawl folder for each crawl.
Clearly, the third solution is the best amongst all the solutions. However, just creating a new URL filter is not enough
for achieving the above mentioned advantages. The new URL filter will be called while optimizing the crawl database
module resulting in removal of documents generated (i.e., news articles) after parsing. To avoid this, the architecture must
be changed to call this filter only in Generator phase, while the rest of the filters can be applied to other phases.
The generator by default reads all the filters and run these set of filters in order at every phase of pipeline. If this
property is empty then all URL filters available to the crawler are called. To enforce the generator to call only specific set
of URL filters, we modified the module to specify the URL filters to be applied in generator. This ensures that only our
filter gets applied in generator. This filter filters out non RSS links from the fetch list. Along with this parameter, we need
to put the names of URL filters that need to be called by other modules of a crawler. The system configured by making the
changes mentioned above is used directly for continuous crawling of news articles.
3.2. Algorithm
RSS feeds that have changed from last fetch, decrease time for next fetch by a constant time DEC_TIME
RSS feeds that have not changed from last fetch,increase time for next fetch by a constant time INC_TIME
calculate delta = fetchTime - modifiedTime
Sync the time of change, by shifting the next fetchTime by a fraction of the difference between the last
modification time and the last fetch time.
If the adjusted fetch interval is bigger than the delta, thenfetchInterval = delta.
4. EXPERIMENTS AND RESULTS
4.1. Experimental Setup
To perform the experiments of continuous crawling of news pages, we collect huge collection of news sites for Hindi and
Marathi. The total number of news sites covered in Hindi and Marathi are almost 40 and 20 receptively. For every URL in
the list we pro-actively estimate the next fetch time based on the popularity and modification history of the page and in-
turn minimize the resources consumed for continuous crawling of news data by a search engine.
4.2. Existing Systems
There are two existing systems that can be compared with the current algorithm
1. Periodic crawler: This is a fetching mechanism where the crawler fetches the pages periodically with fixed time
interval. The drawback of this approach is that the latest documents uploaded in the web will not be crawled by
our search engine till the interval is finished. Also, setting this interval is a tedious heuristic process.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
364 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. 2. Continuous crawler: This is a special case periodic crawler where the interval is set of zero. The advantage of this
system is that all the documents are crawled without delay by the search engine but requires huge amount of
resources as the crawling is done without any delay and repeatedly.
4.3. Results
The X-axis denotes consecutive crawls of a URL and Y-axis denotes the interval in minutes between xth
(current) and x-1th
crawl. The actual modification time for any URL is not available to the scheduler. The scheduler tries to synchronize with
the updation period iteratively by changing the fetch interval. The fetch interval is decreased or increased based on whether
the page is modified or not respectively. We observe from the graph that after few crawls of a URL, the fetch interval
oscillates between 50 to 90 minutes. This indicates that the modification frequency of the page is consistent and lies within
this range.
To study the behaviour across all URLs we carried out experiments to find the average update frequency of each
URL for Hindi and Marathi news search systems. For these experiments, we had set 20 seconds as lower bound for
interval, while the upper bound was 24 hours as in the previous case.
Average fetch interval for Marathi news figure 1 shows the average fetch interval for each RSS link of Marathi
news sites. We observed that most of the links had an average interval of 12 hours. This indicates that the links has a
modification frequency close to 12 hours.
Figure 1: Average fetch interval for Marathi
Average fetch interval for Hindi news figure 2 shows the average fetch interval for each RSS link of Hindi news
sites. We observed that Hindi news sites update more frequently than Marathi sites. Hindi news links have an average
update period of around 1.2 hours. This study helps us in configuring the re-crawl period to a value such that the resource
utilization is maximum.
Figure 2: Average fetch interval for Hindi
5. CONCLUSIONS
We have attempted to solve the problem of incremental crawling. We have created a framework for continuous crawling of
news articles to maintain freshness of the crawl. We in particular addressed the question of what architectural changes are
needed to build a news search system which demands continuous updation on its crawl and index. We listed the challenges
faced to have a near real-time index for news search and provided solutions for the same.
REFERENCES
Bra, P. D., Houben, G.-j., Kornatzky, Y., & Post, R. (1994). Information Retrieval in Distributed
Hypertexts. Proceedings of RIAO, Intelligent Multimedia, Information Retrieval Systems and Management. New York.
Cho, J., & Garcia-Molina, H. (1999). The Evolution of the Web and Implications for an Incremental Crawler. Stanford.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
365 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
5. Cho, J., & Garcia-Molina, H. (2000). Synchronizing a database to Improve Freshness. MOD. Dallas, TX USA: ACM.
Hersovic, M., Jacovi, M., Maarek, Y. S., Pelleg,D., Shtalhaim, M., & Ur, S. (1998). The sharksearch algorithm. An
application: tailored Web site mapping. Proceedings of the seventh international conference on World Wide Web (pp. 317-
326). Amsterdam: Elsevier Science Publishers.
Jamali, M., Sayyadi, H., Hariri, B. B., & Abolhassani, H. (2006). A Method for Focused Crawling Using Combination of
Link Structure and Content Similarity. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
(pp. 753-756). Washington, DC, USA: IEEE Computer Society.
Kan, M.-Y. (2004). Web page classification without the web page. Proceedings of the 13th
international World Wide Web
conference on Alternate track papers & posters. New York: ACM.
Kan, M.-Y., & Thi, H. O. (2005). Fast webpage classification using URL features. Proceedings of the 14th ACM
international conference on Information and knowledge management. ACM.
Sigurosson, K. (2005). Incremental crawling with Heritrix. In Proceedings of the 5th International Web Archiving
Workshop (IWAW 05).
Xu, Q., & Zuo, W. (2007). First-order focused crawling. Proceedings of the 16th international conference on World Wide
Web (pp. 1159-1160). New York, NY, USA: ACM.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
366 https://sites.google.com/site/ijcsis/
ISSN 1947-5500