Weitere ähnliche Inhalte
Ähnlich wie 50120140503012
Ähnlich wie 50120140503012 (20)
Mehr von IAEME Publication
Mehr von IAEME Publication (20)
Kürzlich hochgeladen (20)
50120140503012
- 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME
107
WEB INFORMATION RETRIEVAL USING AUTOMATIC
MULTI-DOCUMENT SUMMARIZATION
Rahul Shankarrao Khokale
Department of Computer Science & Engineering,
Priyadarshini Indira Gandhi College of Engineering,
NAGPUR (India)
Mohammad Atique
Post Graduate Department of Computer Science,
Sant Gadge Baba Amravati University,
AMRAVATI (India)
ABSTRACT
Today, internet has become the most important source of information. People are highly
accustomed to the use of internet for acquiring information which they need. Many times, it is
revealed that, the information seeker does not get relevant information very easily due to the
presence of non-relevant web pages. This paper addresses the problem of effective information
retrieval from the web. In this paper, the notion of Web Information Retrieval using Automatic
Multi-document Summarization is presented. The proposed work is blend of Web technology and
Natural Language Processing. When user will fire the query, the system tries to fetch web pages
from different web servers, and they are indexed as per the order of relevance. The degree of
relevance is not determined by the how many times the keywords of query is present in the document
but it is determined on the basis of semantic content of the document and the user query
Keywords: Web Information Retrieval, Automatic Multi-Document Summarizations,
Web Technology, Information Retrieval and Natural Language Processing
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 5, Issue 3, March (2014), pp. 107-114
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2014): 8.5328 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME
108
I. INTRODUCTION
1.1 Information Retrieval
The Internet and the Web offer new opportunities and challenges to information retrieval
researchers. With the information explosion and never ending increase of web pages as well as
digital data, it is very hard to retrieve useful and reliable information from the Web. Materials from
millions of web pages from organizations, institutions and personnel have been made public
electronically accessible to millions of interested users. The Web uses an addressing system called
Uniform Resource Locators (URLs) to represent links to documents on web servers. These URLs
provide location information. Like titles of books in traditional libraries, no one can remember all
URLs on the Web. Web search engines allow us to locate the internet resources through thousands of
Web pages. It is almost impossible to get the right information as there is too much irrelevant and out
dated information. Information retrieval systems provide useful information in libraries to
researchers.
The Web can be viewed as a virtual library. Information retrieval is an important and major
component of the Internet and the Web in the information age and should play an important role in
knowledge discovery. General search engines such as, Google, AltaVista, Excite are considered as
the powerful search engines so far. Most of the current search engines are based on words, not the
concepts. When searching for certain information or knowledge with a search engine, one can only
use a few key words to narrow down the search. The result of the search is tens or maybe hundreds
of relevant and irrelevant links to various Web pages.
In spite of the voluminous studies in the field of intelligent retrieval systems, effective
retrieving of information has been remained an important unsolved problem. Implementations of
different conceptual knowledge in the information retrieval process such as ontology have been
considered as a solution to enhance the quality of results. Furthermore, the conceptual formalism
supported by typical ontology may not be sufficient to represent uncertainty information due to the
lack of clear-cut boundaries between concepts of the domains [1] “Information retrieval is a field
concerned with the structure, analysis, organization, storage, searching, and retrieval of information.”
(Salton, 1968).
1.2 Automatic Text Summarization
Automatic Text Summarization means the process of extraction and representation of most
important content from the source document in the condensed form. This process involves Document
Preprocessing, Feature Extraction, Sentence Ranking and Summary Generation [3]. Preprocessing is
accomplished by Tokenization, Sentence Splitting, POS Tagging etc. Feature Extraction includes
Word Frequency Extraction and Sentence Ordering. Weighing Sentences helps to score sentences
required for Sentence Ranking. Summary Generation is the resulting phase of automatic text
summarization.
1.3 Multi-Document Summarization
Multi-document text summarization deals with retrieval of salient information about a topic
from various sources. The task of multi-document summarization is to identify a set of sentences,
phrases or some generated semantically correct language units carrying some useful information.
Then significant sentences are extracted from this set and re-organized them to get multi-documents’
summary [1]. Let D = {D1, D2,.......,Dn} be a set of documents. Let S = {S1, S2, ...... ,Sn} be a set of
summary of each document respectively.
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME
109
Figure 1: Multi –Document Summarization
II. RELATED WORK
In recent years, the research focus in the domain of natural language processing and
information retrieval has been shifted to the area of automatic document summarization. Automatic
document summarization is of two types : abstractive and extractive.
The research in this field began with Term Frequency based summarization. Following
researchers used term frequency based approach for document summarization. G. Salton, 1989,
Jun'ichi Fukumoto, 2004, You Ouyang, 2009 and Mr.Vikrant Gupta, 2012 [1]. Inderjeet Mani, 1997,
Rada Mihalcea, 2004, Junlin Zhang, 2005, Xiaojun Wan, 2008, Kokil Jaidka, 2010 [1] carried out
research for document summarization using Graph-based approach. Kathleen McKeown, 1995,
Xiaojun Wan, 2007 used Time-Based method for document summarization. Sentence Correlation
method was implemented for document summarization by Shanmugasundaram Hariharan, 2012,
Tiedan Zhu, 2012. Clustering-Based method for document summarization was proposed by Jade
Goldstein, 2000. Vikrant Gupta el at [2], developed an auto-summarization tool using statistical
techniques. The techniques involve finding the frequency of words, scoring the sentences, ranking
the sentences etc. Yogan Jaya Kumar et al. [3] discussed Automatic Multi Document Summarization
Approaches. Y. Surendranadha Reddy el at [4] presented a summarization system that produces a
summary for a given web document based on sentence importance measures such as sentence
ranking. Tiedan Zhu et al [5] proposed an improved approach to sentence ordering for Multi-
document Summarization.
III. PROPOSED WORK
This paper deals with the framework for “Web Information Retrieval based on Multi-Document
Summarization”. The proposed framework is shown in Figure 2. However, in this paper, the
emphasis is given on the multi-level document summarization. The basic purpose of this framework
is to enhance the effectiveness of web information retrieval. As the indexing and ranking of the
retrieved documents is supported by intelligent decision making system which is based on fuzzy
inference rules, the degree of relevance can be increased.
Document
(D1)
Document
(D2)
Document
(Dn)
Summary
(S1)
Summary
(S2)
Summary
(Sn)
Significant and
common
sentences
extraction
Sentences Re-
ordering based
on semantic
contents
Multi-
Document
Summary
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME
110
User
Query
Figure 2: Framework for web information retrieval
3.1 Intelligent Query Processing
When a query is written by the user and submitted to the system, it is required to manipulate
it and represent it in proper form. Intelligent Query Processing (IQP) module helps user to formulate
his query. WordNet is used for identifying synonyms and thesaurus. This intelligent unit tries to
understand the user need and accordingly it classifies the query into any of three types :
Informational, Navigational and Transactional.
3.2 Search Engine
Search Engine uses web crawler to traverse the World Wide Web to fetch matching URLs.
3.3 Multi-Document Summarization
The n number of links/URLs is the input to the Multi-Document Summarization (MDS) unit.
Each of these inputs can be HTML web page or a text/PDF file. MDS finds summary of each
document separately and finally combines all of them to form single summary. It involves significant
sentences identification, sentences reordering etc. The detail algorithm for this is discussed in section
IV.
3.4 Indexing
The process of generating inverted indices of the retrieved documents is Indexing. It is also
an important step in information retrieval.
Intelligent
Query
Processing
Search Engine
URL1
URL2
URL3
URLn
Multi-
document
Summarization
Indexing
Ranking
Documents
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME
111
3.5 Ranking
Ranking means determining the weight or rank of each retrieved web page or web document.
Our page rank strategy is based on the summary generated by MDS unit. The Page Rank Algorithm
will use fuzzy inference system to judge the relevance of the web page according to the user query.
IV. AUTOMATIC DOCUMENT SUMMARIZATION MODEL
To summarize a document or documents, a reader has to understand the document(s) and
integrate information and make connections across sentences to form a coherent discourse
representation. We designed and developed a new generic algorithm for automatic document
summarization based on the analysis of human cognition and intelligence. In order make this model
applicable for summarization we define the concept of ‘event’. ‘Event’ is a cognitive psychological
concept, and can be either a story or a sentence in microstructure. We have treated each sentence as
an event in this paper. The cognitive model is shown below
document
Figure 3: Cognitive Model
As each event is representing sentence in the document, it is a combination of two parts :
Subject and Predicate. e.g. Consider the example of an event :
Event : Tiger is a wild animal
Subject Predicate
This event can be represented in predicate form by using FOPL syntax
wild_animal(tiger)
The representation of events in predicate form is required to create the knowledge about the
document. In addition to that, we have defined inference rules to understand the connection of one
sentence to others. This helps us to decide the significance of sentence within the document and used
for sentence re-ordering.
Event 1 Event 1 Event N
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME
112
The Algorithm for Automatic text/document summarization is discussed below:
Figure 4: Algorithm for Text summarization
V. EXPERIMENTAL RESULTS
We have implemented the multi-document summarization system in JAVA.
5.1 Test Data/Corpus : We have used the standard CACM test data with 3204 test samples.
Following are the documents chosen from CACM test data.
Document : CACM—0276.html
START
Read the
document
Tokenization
Stemmer
POS Tagger
Summary
Cognitive Model
Program Organization and Record Keeping for Dynamic Storage
Allocation
The material presented in this paper is part
of the design plan of the core allocation portion
of the ASCII-MATIC Programming System. Project ASCII-MATIC
is concerned with the application of computer
techniques to the activities of certain headquarters
military intelligence operations of the U. Army.
CACM October, 1961
Holt, A. W.
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME
113
Summary :
5.2 Term Frequency Computation
The term frequency of all the terms in the sentence is calculated individually. The sum of all
these frequencies will give the T(si,),term frequency of sentence si. This can be calculated by:
ܶሺܵ݅ሻ ൌ ݐ כ ݂ሺ݅ݐሻ
௧ୀଵ
Where, ti is the ith
term in the sentence. And t*f(ti) is the term frequency of term ti.
Therefore, T(Si) for document 1
T(Si)=(1+1+1+1+1+1+1+1+2+5+1+1+1+1+1+1+3+1+1+1+1+2+2+1+1+1+1+1+1+1+1+1
+1+1+1+1+1+1+1+1+1)
= 50
5.3 TF-IDF or term-frequency inverse-document-frequency
TF-IDF or term-frequency inverse-document-frequency is computed as the ratio of the
quantity of terms in that document to the frequency of the quantity of documents containing that
terms. For above document the TF_IDF is given below:
ܶܨܦܫ_ܨ ൌ
41
50
=0.82
V. CONCLUSION
In this paper, we have proposed a framework for web information retrieval using multi-
document summarization. The emphasis was given on the multi-level document summarization. The
basic purpose of this framework is to enhance the effectiveness of web information retrieval. As the
indexing and ranking of the retrieved documents is supported by intelligent decision making system
which is based on fuzzy inference rules, the degree of relevance can be increased. The experiment
are carried out on the CACM test data, and it is found that information retrieval results can be
improved after multi-document summarization process is performed.
Title : Program Organization and Record Keeping for Dynamic
Storage Allocation
Author : Holt, A. W.
Publication & Year : CACM October, 1961
Part of the design plan of the core allocation portion of
the ASCII-MATIC Programming System. Project ASCII-MATIC deals
with the application of computer techniques to the activities
of certain headquarters military intelligence operations of the
U. Army.
- 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME
114
REFERENCES
[1] Md. Majharul Haque, Suraiya Pervin, and Zerina Begum, Literature Review of Automatic
Multiple Documents Text Summarization, International Journal of Innovation and Applied
Studies, Vol. 3 No. 1 May 2013, pp. 121-129
[2] Vikrant Gupta, Priya Chauhan, Sohan Garg, Anita Borude, Shobha Krishnan, An Statistical
Tool for Multi-Document summarization, International Journal of Scientific and Research
Publications, Volume 2, Issue 5, May 2012
[3] Yogan Jaya Kumar and Naomie Salim, Automatic Multi Document Summarization
Approaches, Journal of Computer Science 8 (1): 133-140,
[4] Y. Surendranadha Reddy and A.P. Siva Kumar, An Efficient Approach for Web document
summarization by Sentence Ranking, International Journal of Advanced Research in
Computer Science and Software Engineering, Volume 2, Issue 7, July 2012
[5] Tiedan Zhu, Xinxin Zhao, An Improved Approach to Sentence Ordering For Multi-document
Summarization, 2012 IACSIT Hong Kong Conferences, IPCSIT vol. 25 (2012) © (2012)
IACSIT Press, Singapore
[6] Nikola Vlahovic, Information Retrieval and Information Extraction in Web 2.0 environment,
International Journal of Computers, Issue 1, Volume 5, 2011
[7] Yi Guo and George Stylios, An Intelligent Algorithm For Automatic Document
Summarization.
[8] Prakasha S, Shashidhar HR and Dr. G T Raju, “A Survey on Various Architectures, Models
and Methodologies for Information Retrieval”, International Journal of Computer
Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375.
[9] Mousmi Chaurasia and Dr. Sushil Kumar, “Natural Language Processing Based Information
Retrieval for the Purpose of Author Identification”, International Journal of Information
Technology and Management Information Systems (IJITMIS), Volume 1, Issue 1, 2010,
pp. 45 - 54, ISSN Print: 0976 – 6405, ISSN Online: 0976 – 6413.