SlideShare ist ein Scribd-Unternehmen logo
1 von 5
Downloaden Sie, um offline zu lesen
Proactive Approach to Estimate the Re-crawl Period for Resource
Minimization in Information Retrieval
Emmanuel Rajarathnam1
and K Arthi2
1
Research Scholar, Veltech University, Avadi, Chennai, Tamil Nadu, India.
2
Professor, Veltech University, Avadi, Chennai, Tamil Nadu, India.
e_rajarathnam@rediffmail.com, arthi@veltechuniv.edu.in
ABSTRACT
A news search engine is a software system that is designed to search news articles published in various news sites. We
develop a near real time news search for Marathi and Hindi languages using the open source crawler, Nutch and the
indexer, Solr. In this work, we in particular address the question of what architectural changes need to be done in Nutch
and Solr to build a news search system which demands continuous updating on its crawl and index. We list all the
challenges faced to have a near real time index for news search and provide solutions for the same. Our customized open
source framework can be adapted to news search solution for any language.
We describe a methodology for maintaining freshness of the crawl by creating an incremental crawling architecture. We
show how this architecture maintains crawl freshness through a case study of Marathi and Hindi news search. We propose
a method of prioritizing URLs for fetching in a resource constrained scenario. We describe a technique to score a URL
before it is fetched, based on its own features and features from its parent URLs that were crawled previously.
KEYWORDS
Re-Crawl, Fetch time estimation, proactive approach
1. INTRODUCTION
Crawling is a process of collecting documents from web and indexing them for retrieval. World Wide Web has a huge
collection of documents that need to be processed and stored for a generic web search. There exists other search systems
viz., enterprise search, focussed search, domain search, etc., that concentrate on a particular section of the web. The
crawling infrastructure required to process the entire WWW is so high that the small scale/academic organizations cannot
afford them. In most of the academic/small scale organizations, crawling infrastructure has constraints on availability of
resources like processing power, bandwidth, memory, etc.
Even though the open source crawling frameworks helps in crawling the entire web, the resource constraints limit
the number of documents being crawled. Resource constraints demands three major tasks for efficient retrieval:
1. Limit maximum number of URLS crawled in each depth
2. URL prioritization
3. URL scheduling
In the absence of any resource constraints, none of the above is required as we have enough resources to crawl
any number of URLs at a time. Resource constraints force the crawler to fetch limited number of URLs in each depth.
Hence, a demand to carefully choose the URLs for crawling in each depth exists. Also it highly recommended to decide the
re-crawl period carefully so as not to overload the crawler. This case is prevalent in a news search engine. A news search
engine is a software system designed to search news articles published in various sites. News gets added almost every
minute (perhaps lesser) on different sites. It is cumbersome to manually go through each newspaper site and browse for
relevant news. For obtaining news relevant to the users information need from different sources, a news search engine is
required. Any news search engine is designed to fulfil 3 core objectives:
1. Satisfying information need of user: Satisfying information need of the user forms the primary objective of any
search engine. The query can be news in any particular location or on recent event. A news search engine should
be able to retrieve relevant news articles related to user’s information need and rank them based on relevance and
published date and time.
2. One stop solution for all news articles: News articles are available in different categories like sports, business,
general, etc. A news search engine acts as a single repository of news from various sources and categories. This
enables user to look for relevant news in a single location without manually browsing each site and search for
relevant news articles.
3. Multiple Perspectives of same news: News can be looked at from different perspectives. Presenting all these views
to the user is important. For e.g. some newspaper author may be biased towards a particular political party and
praise them for their move while others can criticize them for their move.
In this work, we provide the solution for resource minimization in crawling by pro-actively choose the re-crawl period
for every document in the search index. Also we evaluate the proposed method by building a news search for Marathi and
Hindi.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, August 2017
362 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. RELATED WORK
While incremental crawling and scheduling helps in maintaining crawl up-to-date, it requires a lot of resources which may
not be available in a small scale organization. In small organizations or academic institutions, resources available are
limited which puts limit on the throughput of the crawler. In such a scenario, crawling important URLs first is required.
Scheduling policies help in optimizing the bandwidth required for crawling by increasing the crawl period of pages which
are not updated frequently. However, even within URLs, which have more frequency of updates, we need to have a priority
on the URLs to be fetched first.
A lot of work exists on classifying URLs based on page content. Page content based classification can be helpful
in deciding the next fetch interval of the URL. However, classification of URLs before fetching is required for prioritizing
URLs in current depth.
There are many research contributions for prioritizing URLs during fetching. Some of them use purely URL based
features, while some of them use parent information, contextual information, etc. Let us look at few of them.
Min-Yen Kan (2004) quantified the performance of web page classification using only the URL features(URL
text) against anchor text, title text and page text, showed that URL features when treated correctly, exceeds the
performance of some source-document based features.
Min-Yen Kan et al. (2005) added URL features, component length, content, orthography, token sequence and
precedence to model URL. The resulting features, used in supervised maximum entropy modelling, significantly improve
over existing URL features.
Fish search (Bra et. al., 1994), one of the first dynamic Web search algorithms, takes as input a seed URL and a
search query, and dynamically builds a priority list (initialized to the seed URL) of the next URLs (hereafter called nodes)
to be explored. As each document’s text becomes available, it is analysed by a scoring component evaluating whether it is
relevant or irrelevant to the search query (1-0 value) and, based on that score. A heuristic decides whether to pursue the
exploration in that direction or not.
Shark search algorithm (Hersovic, 1998), a more aggressive algorithm, instead of binary evaluation of document
relevance, returns a score between 0 and 1 in order to evaluate the relevance of documents to a given query, which has
direct impact on priority list. Shark search calculates potential score of the children not only by propagating ancestral
relevance scores deeper down the hierarchy, but also by making use of the meta-information contained in the links to
documents.
Jamali et al. (2006) used the link structure analysis with the similarity of the page context to determine the
download pages priority, while Xu & Zuo (2007) use the hyperlinks to discover the relationships between the web pages.
3. PROPOSED METHOD
News sites have different frequencies of updating. There are multiple RSS links within a site for different categories of
news articles. Each of these links gets updated at different time intervals. Thus, in the process of crawling RSS links
continuously, it is important to prioritize URLs based on update frequency. Continuous crawling of all URLs consume
more resources as it fetches and parses pages which do not change frequently. This causes significant delay in re-fetch of
remaining pages.
To prioritize URLs, we perform fetching of RSS links using an adaptive scheduler. The scheduler assigns lesser fetch
intervals to URLs which get updated more frequently as compared to other URLs.
Following are the different parameters used in adaptive scheduler:
 Default Fetch Interval: We set the default fetch interval time for re-crawling a page as 3500 seconds. This ensures
that the page after this period is expected to change by the crawler.
 Maximum Fetch Interval: The maximum amount of time after which the page becomes available for re-crawling.
This value is set to 43200 seconds which amounts to 12hrs.
 Incremental Rate: After re-crawling a page, if the page found to be not modified, we need to increment the next
fetch time for the page. This value (set to 0.4) indicates the rate at which increment of the next fetch time should
take place.
 Decremental Rate: Rate at which the next fetch time of a page to be decided if the page being re-crawled now has
been modified from the last time. This value is set to 0.2.
 Min-Max Interval: The next fetch time of a page should have a boundary values indicated by this interval.
To configure appropriate values for minimum fetch interval and maximum fetch interval we carried out few
experiments. Initially we had set minimum interval as 20 seconds and maximum interval as 24 hours to study the rate of
change of news articles. The value of 20 seconds is much less than the time taken to complete one crawl cycle. Since we
target daily newspaper sites, we have kept the maximum interval as 24 hours.
3.1. Proactive Re-crawl Estimator
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
363 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
While crawling RSS links, we need to ensure that only initial set of seed URLs gets crawled every time. Any outlines from
the RSS page should be discarded. To do so we set the property update-additions to false. But by doing so, the
subdocuments generated after parsing the RSS page also gets discarded. The possible solutions to this problem are:-
3.1.1. Revert Update-Additions property
In this case, the documents generated after parsing won’t be discarded. However, a lot of junk URLs may be added to the
fetch list which may not be related to news increasing the bandwidth usage.
3.1.2. Different Crawl Container in Every iteration
Another way to approach the problem is to use a different crawl database during each crawl along with keeping update-
additions to true. The advantages of using this method are:-
 Only the seed URLs gets crawled every time.
 No junk pages in crawl.
The disadvantage of this method is that proactive estimation requires information from the crawl database for setting
the next fetch interval. This information in not available while doing a fresh crawl. Thus in this case, proactive estimation
won’t work.
3.1.3. New Filter
The third approach is to create a new filter which filters out non RSS URLs from the fetchlist. The main advantages of this
method are:-
 No sub document is discarded.
 Only RSS links gets crawled. Hence crawl contains no junk/ad pages. Proactive estimation works properly as we
use the same crawl folder for each crawl.
Clearly, the third solution is the best amongst all the solutions. However, just creating a new URL filter is not enough
for achieving the above mentioned advantages. The new URL filter will be called while optimizing the crawl database
module resulting in removal of documents generated (i.e., news articles) after parsing. To avoid this, the architecture must
be changed to call this filter only in Generator phase, while the rest of the filters can be applied to other phases.
The generator by default reads all the filters and run these set of filters in order at every phase of pipeline. If this
property is empty then all URL filters available to the crawler are called. To enforce the generator to call only specific set
of URL filters, we modified the module to specify the URL filters to be applied in generator. This ensures that only our
filter gets applied in generator. This filter filters out non RSS links from the fetch list. Along with this parameter, we need
to put the names of URL filters that need to be called by other modules of a crawler. The system configured by making the
changes mentioned above is used directly for continuous crawling of news articles.
3.2. Algorithm
 RSS feeds that have changed from last fetch, decrease time for next fetch by a constant time DEC_TIME
 RSS feeds that have not changed from last fetch,increase time for next fetch by a constant time INC_TIME
 calculate delta = fetchTime - modifiedTime
 Sync the time of change, by shifting the next fetchTime by a fraction of the difference between the last
modification time and the last fetch time.
 If the adjusted fetch interval is bigger than the delta, thenfetchInterval = delta.
4. EXPERIMENTS AND RESULTS
4.1. Experimental Setup
To perform the experiments of continuous crawling of news pages, we collect huge collection of news sites for Hindi and
Marathi. The total number of news sites covered in Hindi and Marathi are almost 40 and 20 receptively. For every URL in
the list we pro-actively estimate the next fetch time based on the popularity and modification history of the page and in-
turn minimize the resources consumed for continuous crawling of news data by a search engine.
4.2. Existing Systems
There are two existing systems that can be compared with the current algorithm
1. Periodic crawler: This is a fetching mechanism where the crawler fetches the pages periodically with fixed time
interval. The drawback of this approach is that the latest documents uploaded in the web will not be crawled by
our search engine till the interval is finished. Also, setting this interval is a tedious heuristic process.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
364 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. Continuous crawler: This is a special case periodic crawler where the interval is set of zero. The advantage of this
system is that all the documents are crawled without delay by the search engine but requires huge amount of
resources as the crawling is done without any delay and repeatedly.
4.3. Results
The X-axis denotes consecutive crawls of a URL and Y-axis denotes the interval in minutes between xth
(current) and x-1th
crawl. The actual modification time for any URL is not available to the scheduler. The scheduler tries to synchronize with
the updation period iteratively by changing the fetch interval. The fetch interval is decreased or increased based on whether
the page is modified or not respectively. We observe from the graph that after few crawls of a URL, the fetch interval
oscillates between 50 to 90 minutes. This indicates that the modification frequency of the page is consistent and lies within
this range.
To study the behaviour across all URLs we carried out experiments to find the average update frequency of each
URL for Hindi and Marathi news search systems. For these experiments, we had set 20 seconds as lower bound for
interval, while the upper bound was 24 hours as in the previous case.
Average fetch interval for Marathi news figure 1 shows the average fetch interval for each RSS link of Marathi
news sites. We observed that most of the links had an average interval of 12 hours. This indicates that the links has a
modification frequency close to 12 hours.
Figure 1: Average fetch interval for Marathi
Average fetch interval for Hindi news figure 2 shows the average fetch interval for each RSS link of Hindi news
sites. We observed that Hindi news sites update more frequently than Marathi sites. Hindi news links have an average
update period of around 1.2 hours. This study helps us in configuring the re-crawl period to a value such that the resource
utilization is maximum.
Figure 2: Average fetch interval for Hindi
5. CONCLUSIONS
We have attempted to solve the problem of incremental crawling. We have created a framework for continuous crawling of
news articles to maintain freshness of the crawl. We in particular addressed the question of what architectural changes are
needed to build a news search system which demands continuous updation on its crawl and index. We listed the challenges
faced to have a near real-time index for news search and provided solutions for the same.
REFERENCES
Bra, P. D., Houben, G.-j., Kornatzky, Y., & Post, R. (1994). Information Retrieval in Distributed
Hypertexts. Proceedings of RIAO, Intelligent Multimedia, Information Retrieval Systems and Management. New York.
Cho, J., & Garcia-Molina, H. (1999). The Evolution of the Web and Implications for an Incremental Crawler. Stanford.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
365 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Cho, J., & Garcia-Molina, H. (2000). Synchronizing a database to Improve Freshness. MOD. Dallas, TX USA: ACM.
Hersovic, M., Jacovi, M., Maarek, Y. S., Pelleg,D., Shtalhaim, M., & Ur, S. (1998). The sharksearch algorithm. An
application: tailored Web site mapping. Proceedings of the seventh international conference on World Wide Web (pp. 317-
326). Amsterdam: Elsevier Science Publishers.
Jamali, M., Sayyadi, H., Hariri, B. B., & Abolhassani, H. (2006). A Method for Focused Crawling Using Combination of
Link Structure and Content Similarity. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
(pp. 753-756). Washington, DC, USA: IEEE Computer Society.
Kan, M.-Y. (2004). Web page classification without the web page. Proceedings of the 13th
international World Wide Web
conference on Alternate track papers & posters. New York: ACM.
Kan, M.-Y., & Thi, H. O. (2005). Fast webpage classification using URL features. Proceedings of the 14th ACM
international conference on Information and knowledge management. ACM.
Sigurosson, K. (2005). Incremental crawling with Heritrix. In Proceedings of the 5th International Web Archiving
Workshop (IWAW 05).
Xu, Q., & Zuo, W. (2007). First-order focused crawling. Proceedings of the 16th international conference on World Wide
Web (pp. 1159-1160). New York, NY, USA: ACM.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
366 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

Weitere ähnliche Inhalte

Was ist angesagt?

Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search EngineNIKHIL NAIR
 
WT - Web & Working of Search Engine
WT - Web & Working of Search EngineWT - Web & Working of Search Engine
WT - Web & Working of Search Enginevinay arora
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engineSylvain Utard
 
Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)ROHIT SAHU
 
working of search engine & SEO
working of search engine & SEOworking of search engine & SEO
working of search engine & SEODeepak Singh
 
How google search engine work
How google search engine workHow google search engine work
How google search engine workLạc Lạc
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Pm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyPm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyprashant mishra
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalA. LE
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER ijcseit
 
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalA survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
 
Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexingbalaabirami
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_pptManant Sweet
 
Working of search engine
Working of search engineWorking of search engine
Working of search engineNikhil Deswal
 

Was ist angesagt? (19)

Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
WT - Web & Working of Search Engine
WT - Web & Working of Search EngineWT - Web & Working of Search Engine
WT - Web & Working of Search Engine
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engine
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)
 
working of search engine & SEO
working of search engine & SEOworking of search engine & SEO
working of search engine & SEO
 
How google search engine work
How google search engine workHow google search engine work
How google search engine work
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Pm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyPm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodology
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER
 
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalA survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
 
Smart Searching
Smart SearchingSmart Searching
Smart Searching
 
Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexing
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Search engine
Search engineSearch engine
Search engine
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_ppt
 
Working of search engine
Working of search engineWorking of search engine
Working of search engine
 

Ähnlich wie Proactive Approach to Estimate the Re-crawl Period for Resource Minimization in Information Retrieval

Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
 
Comparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievalComparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievaleSAT Journals
 
Google indexing
Google indexingGoogle indexing
Google indexingtahoor71
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
 
PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportIOSR Journals
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the webVan-Duyet Le
 
Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  dannyijwest
 
Open source search engine
Open source search engineOpen source search engine
Open source search enginePrimya Tamil
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimizationBookStoreLib
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainseSAT Publishing House
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainseSAT Journals
 
SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in Indiaannakoch32
 
Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...Editor IJCATR
 

Ähnlich wie Proactive Approach to Estimate the Re-crawl Period for Resource Minimization in Information Retrieval (20)

Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Comparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievalComparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrieval
 
Google indexing
Google indexingGoogle indexing
Google indexing
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
G017254554
G017254554G017254554
G017254554
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey report
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
E3602042044
E3602042044E3602042044
E3602042044
 
Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  
 
Open source search engine
Open source search engineOpen source search engine
Open source search engine
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domains
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domains
 
SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in India
 
Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...
 

Kürzlich hochgeladen

CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxAneriPatwari
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
ARTERIAL BLOOD GAS ANALYSIS........pptx
ARTERIAL BLOOD  GAS ANALYSIS........pptxARTERIAL BLOOD  GAS ANALYSIS........pptx
ARTERIAL BLOOD GAS ANALYSIS........pptxAneriPatwari
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Celine George
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 

Kürzlich hochgeladen (20)

CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptx
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
ARTERIAL BLOOD GAS ANALYSIS........pptx
ARTERIAL BLOOD  GAS ANALYSIS........pptxARTERIAL BLOOD  GAS ANALYSIS........pptx
ARTERIAL BLOOD GAS ANALYSIS........pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 

Proactive Approach to Estimate the Re-crawl Period for Resource Minimization in Information Retrieval

  • 1. Proactive Approach to Estimate the Re-crawl Period for Resource Minimization in Information Retrieval Emmanuel Rajarathnam1 and K Arthi2 1 Research Scholar, Veltech University, Avadi, Chennai, Tamil Nadu, India. 2 Professor, Veltech University, Avadi, Chennai, Tamil Nadu, India. e_rajarathnam@rediffmail.com, arthi@veltechuniv.edu.in ABSTRACT A news search engine is a software system that is designed to search news articles published in various news sites. We develop a near real time news search for Marathi and Hindi languages using the open source crawler, Nutch and the indexer, Solr. In this work, we in particular address the question of what architectural changes need to be done in Nutch and Solr to build a news search system which demands continuous updating on its crawl and index. We list all the challenges faced to have a near real time index for news search and provide solutions for the same. Our customized open source framework can be adapted to news search solution for any language. We describe a methodology for maintaining freshness of the crawl by creating an incremental crawling architecture. We show how this architecture maintains crawl freshness through a case study of Marathi and Hindi news search. We propose a method of prioritizing URLs for fetching in a resource constrained scenario. We describe a technique to score a URL before it is fetched, based on its own features and features from its parent URLs that were crawled previously. KEYWORDS Re-Crawl, Fetch time estimation, proactive approach 1. INTRODUCTION Crawling is a process of collecting documents from web and indexing them for retrieval. World Wide Web has a huge collection of documents that need to be processed and stored for a generic web search. There exists other search systems viz., enterprise search, focussed search, domain search, etc., that concentrate on a particular section of the web. The crawling infrastructure required to process the entire WWW is so high that the small scale/academic organizations cannot afford them. In most of the academic/small scale organizations, crawling infrastructure has constraints on availability of resources like processing power, bandwidth, memory, etc. Even though the open source crawling frameworks helps in crawling the entire web, the resource constraints limit the number of documents being crawled. Resource constraints demands three major tasks for efficient retrieval: 1. Limit maximum number of URLS crawled in each depth 2. URL prioritization 3. URL scheduling In the absence of any resource constraints, none of the above is required as we have enough resources to crawl any number of URLs at a time. Resource constraints force the crawler to fetch limited number of URLs in each depth. Hence, a demand to carefully choose the URLs for crawling in each depth exists. Also it highly recommended to decide the re-crawl period carefully so as not to overload the crawler. This case is prevalent in a news search engine. A news search engine is a software system designed to search news articles published in various sites. News gets added almost every minute (perhaps lesser) on different sites. It is cumbersome to manually go through each newspaper site and browse for relevant news. For obtaining news relevant to the users information need from different sources, a news search engine is required. Any news search engine is designed to fulfil 3 core objectives: 1. Satisfying information need of user: Satisfying information need of the user forms the primary objective of any search engine. The query can be news in any particular location or on recent event. A news search engine should be able to retrieve relevant news articles related to user’s information need and rank them based on relevance and published date and time. 2. One stop solution for all news articles: News articles are available in different categories like sports, business, general, etc. A news search engine acts as a single repository of news from various sources and categories. This enables user to look for relevant news in a single location without manually browsing each site and search for relevant news articles. 3. Multiple Perspectives of same news: News can be looked at from different perspectives. Presenting all these views to the user is important. For e.g. some newspaper author may be biased towards a particular political party and praise them for their move while others can criticize them for their move. In this work, we provide the solution for resource minimization in crawling by pro-actively choose the re-crawl period for every document in the search index. Also we evaluate the proposed method by building a news search for Marathi and Hindi. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, August 2017 362 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. 2. RELATED WORK While incremental crawling and scheduling helps in maintaining crawl up-to-date, it requires a lot of resources which may not be available in a small scale organization. In small organizations or academic institutions, resources available are limited which puts limit on the throughput of the crawler. In such a scenario, crawling important URLs first is required. Scheduling policies help in optimizing the bandwidth required for crawling by increasing the crawl period of pages which are not updated frequently. However, even within URLs, which have more frequency of updates, we need to have a priority on the URLs to be fetched first. A lot of work exists on classifying URLs based on page content. Page content based classification can be helpful in deciding the next fetch interval of the URL. However, classification of URLs before fetching is required for prioritizing URLs in current depth. There are many research contributions for prioritizing URLs during fetching. Some of them use purely URL based features, while some of them use parent information, contextual information, etc. Let us look at few of them. Min-Yen Kan (2004) quantified the performance of web page classification using only the URL features(URL text) against anchor text, title text and page text, showed that URL features when treated correctly, exceeds the performance of some source-document based features. Min-Yen Kan et al. (2005) added URL features, component length, content, orthography, token sequence and precedence to model URL. The resulting features, used in supervised maximum entropy modelling, significantly improve over existing URL features. Fish search (Bra et. al., 1994), one of the first dynamic Web search algorithms, takes as input a seed URL and a search query, and dynamically builds a priority list (initialized to the seed URL) of the next URLs (hereafter called nodes) to be explored. As each document’s text becomes available, it is analysed by a scoring component evaluating whether it is relevant or irrelevant to the search query (1-0 value) and, based on that score. A heuristic decides whether to pursue the exploration in that direction or not. Shark search algorithm (Hersovic, 1998), a more aggressive algorithm, instead of binary evaluation of document relevance, returns a score between 0 and 1 in order to evaluate the relevance of documents to a given query, which has direct impact on priority list. Shark search calculates potential score of the children not only by propagating ancestral relevance scores deeper down the hierarchy, but also by making use of the meta-information contained in the links to documents. Jamali et al. (2006) used the link structure analysis with the similarity of the page context to determine the download pages priority, while Xu & Zuo (2007) use the hyperlinks to discover the relationships between the web pages. 3. PROPOSED METHOD News sites have different frequencies of updating. There are multiple RSS links within a site for different categories of news articles. Each of these links gets updated at different time intervals. Thus, in the process of crawling RSS links continuously, it is important to prioritize URLs based on update frequency. Continuous crawling of all URLs consume more resources as it fetches and parses pages which do not change frequently. This causes significant delay in re-fetch of remaining pages. To prioritize URLs, we perform fetching of RSS links using an adaptive scheduler. The scheduler assigns lesser fetch intervals to URLs which get updated more frequently as compared to other URLs. Following are the different parameters used in adaptive scheduler:  Default Fetch Interval: We set the default fetch interval time for re-crawling a page as 3500 seconds. This ensures that the page after this period is expected to change by the crawler.  Maximum Fetch Interval: The maximum amount of time after which the page becomes available for re-crawling. This value is set to 43200 seconds which amounts to 12hrs.  Incremental Rate: After re-crawling a page, if the page found to be not modified, we need to increment the next fetch time for the page. This value (set to 0.4) indicates the rate at which increment of the next fetch time should take place.  Decremental Rate: Rate at which the next fetch time of a page to be decided if the page being re-crawled now has been modified from the last time. This value is set to 0.2.  Min-Max Interval: The next fetch time of a page should have a boundary values indicated by this interval. To configure appropriate values for minimum fetch interval and maximum fetch interval we carried out few experiments. Initially we had set minimum interval as 20 seconds and maximum interval as 24 hours to study the rate of change of news articles. The value of 20 seconds is much less than the time taken to complete one crawl cycle. Since we target daily newspaper sites, we have kept the maximum interval as 24 hours. 3.1. Proactive Re-crawl Estimator International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, Augus 2017 363 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. While crawling RSS links, we need to ensure that only initial set of seed URLs gets crawled every time. Any outlines from the RSS page should be discarded. To do so we set the property update-additions to false. But by doing so, the subdocuments generated after parsing the RSS page also gets discarded. The possible solutions to this problem are:- 3.1.1. Revert Update-Additions property In this case, the documents generated after parsing won’t be discarded. However, a lot of junk URLs may be added to the fetch list which may not be related to news increasing the bandwidth usage. 3.1.2. Different Crawl Container in Every iteration Another way to approach the problem is to use a different crawl database during each crawl along with keeping update- additions to true. The advantages of using this method are:-  Only the seed URLs gets crawled every time.  No junk pages in crawl. The disadvantage of this method is that proactive estimation requires information from the crawl database for setting the next fetch interval. This information in not available while doing a fresh crawl. Thus in this case, proactive estimation won’t work. 3.1.3. New Filter The third approach is to create a new filter which filters out non RSS URLs from the fetchlist. The main advantages of this method are:-  No sub document is discarded.  Only RSS links gets crawled. Hence crawl contains no junk/ad pages. Proactive estimation works properly as we use the same crawl folder for each crawl. Clearly, the third solution is the best amongst all the solutions. However, just creating a new URL filter is not enough for achieving the above mentioned advantages. The new URL filter will be called while optimizing the crawl database module resulting in removal of documents generated (i.e., news articles) after parsing. To avoid this, the architecture must be changed to call this filter only in Generator phase, while the rest of the filters can be applied to other phases. The generator by default reads all the filters and run these set of filters in order at every phase of pipeline. If this property is empty then all URL filters available to the crawler are called. To enforce the generator to call only specific set of URL filters, we modified the module to specify the URL filters to be applied in generator. This ensures that only our filter gets applied in generator. This filter filters out non RSS links from the fetch list. Along with this parameter, we need to put the names of URL filters that need to be called by other modules of a crawler. The system configured by making the changes mentioned above is used directly for continuous crawling of news articles. 3.2. Algorithm  RSS feeds that have changed from last fetch, decrease time for next fetch by a constant time DEC_TIME  RSS feeds that have not changed from last fetch,increase time for next fetch by a constant time INC_TIME  calculate delta = fetchTime - modifiedTime  Sync the time of change, by shifting the next fetchTime by a fraction of the difference between the last modification time and the last fetch time.  If the adjusted fetch interval is bigger than the delta, thenfetchInterval = delta. 4. EXPERIMENTS AND RESULTS 4.1. Experimental Setup To perform the experiments of continuous crawling of news pages, we collect huge collection of news sites for Hindi and Marathi. The total number of news sites covered in Hindi and Marathi are almost 40 and 20 receptively. For every URL in the list we pro-actively estimate the next fetch time based on the popularity and modification history of the page and in- turn minimize the resources consumed for continuous crawling of news data by a search engine. 4.2. Existing Systems There are two existing systems that can be compared with the current algorithm 1. Periodic crawler: This is a fetching mechanism where the crawler fetches the pages periodically with fixed time interval. The drawback of this approach is that the latest documents uploaded in the web will not be crawled by our search engine till the interval is finished. Also, setting this interval is a tedious heuristic process. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, Augus 2017 364 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. 2. Continuous crawler: This is a special case periodic crawler where the interval is set of zero. The advantage of this system is that all the documents are crawled without delay by the search engine but requires huge amount of resources as the crawling is done without any delay and repeatedly. 4.3. Results The X-axis denotes consecutive crawls of a URL and Y-axis denotes the interval in minutes between xth (current) and x-1th crawl. The actual modification time for any URL is not available to the scheduler. The scheduler tries to synchronize with the updation period iteratively by changing the fetch interval. The fetch interval is decreased or increased based on whether the page is modified or not respectively. We observe from the graph that after few crawls of a URL, the fetch interval oscillates between 50 to 90 minutes. This indicates that the modification frequency of the page is consistent and lies within this range. To study the behaviour across all URLs we carried out experiments to find the average update frequency of each URL for Hindi and Marathi news search systems. For these experiments, we had set 20 seconds as lower bound for interval, while the upper bound was 24 hours as in the previous case. Average fetch interval for Marathi news figure 1 shows the average fetch interval for each RSS link of Marathi news sites. We observed that most of the links had an average interval of 12 hours. This indicates that the links has a modification frequency close to 12 hours. Figure 1: Average fetch interval for Marathi Average fetch interval for Hindi news figure 2 shows the average fetch interval for each RSS link of Hindi news sites. We observed that Hindi news sites update more frequently than Marathi sites. Hindi news links have an average update period of around 1.2 hours. This study helps us in configuring the re-crawl period to a value such that the resource utilization is maximum. Figure 2: Average fetch interval for Hindi 5. CONCLUSIONS We have attempted to solve the problem of incremental crawling. We have created a framework for continuous crawling of news articles to maintain freshness of the crawl. We in particular addressed the question of what architectural changes are needed to build a news search system which demands continuous updation on its crawl and index. We listed the challenges faced to have a near real-time index for news search and provided solutions for the same. REFERENCES Bra, P. D., Houben, G.-j., Kornatzky, Y., & Post, R. (1994). Information Retrieval in Distributed Hypertexts. Proceedings of RIAO, Intelligent Multimedia, Information Retrieval Systems and Management. New York. Cho, J., & Garcia-Molina, H. (1999). The Evolution of the Web and Implications for an Incremental Crawler. Stanford. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, Augus 2017 365 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. Cho, J., & Garcia-Molina, H. (2000). Synchronizing a database to Improve Freshness. MOD. Dallas, TX USA: ACM. Hersovic, M., Jacovi, M., Maarek, Y. S., Pelleg,D., Shtalhaim, M., & Ur, S. (1998). The sharksearch algorithm. An application: tailored Web site mapping. Proceedings of the seventh international conference on World Wide Web (pp. 317- 326). Amsterdam: Elsevier Science Publishers. Jamali, M., Sayyadi, H., Hariri, B. B., & Abolhassani, H. (2006). A Method for Focused Crawling Using Combination of Link Structure and Content Similarity. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (pp. 753-756). Washington, DC, USA: IEEE Computer Society. Kan, M.-Y. (2004). Web page classification without the web page. Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters. New York: ACM. Kan, M.-Y., & Thi, H. O. (2005). Fast webpage classification using URL features. Proceedings of the 14th ACM international conference on Information and knowledge management. ACM. Sigurosson, K. (2005). Incremental crawling with Heritrix. In Proceedings of the 5th International Web Archiving Workshop (IWAW 05). Xu, Q., & Zuo, W. (2007). First-order focused crawling. Proceedings of the 16th international conference on World Wide Web (pp. 1159-1160). New York, NY, USA: ACM. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, Augus 2017 366 https://sites.google.com/site/ijcsis/ ISSN 1947-5500