Information on the web is tremendously increasing in
recent years with the faster rate. This massive or voluminous data
has driven intricate problems for information retrieval and
knowledge management. As the data resides in a web with several
forms, the Knowledge management in the web is a challenging
task. Here the novel 'Semantic Web' concept may be used for
understanding the web contents by the machine to offer
intelligent services in an efficient way with a meaningful
knowledge representation. The data retrieval in the traditional
web source is focused on 'page ranking' techniques, whereas in
the semantic web the data retrieval processes are based on the
‘concept based learning'. The proposed work is aimed at the
development of a new framework for automatic generation of
ontology and RDF to some real time Web data, extracted from
multiple repositories by tracing their URI’s and Text Documents.
Improved inverted indexing technique is applied for ontology
generation and turtle notation is used for RDF notation. A
program is written for validating the extracted data from
multiple repositories by removing unwanted data and considering
only the document section of the web page.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data
1.
Abstract— Information on the web is tremendously increasing in
recent years with the faster rate. This massive or voluminous data
has driven intricate problems for information retrieval and
knowledge management. As the data resides in a web with several
forms, the Knowledge management in the web is a challenging
task. Here the novel 'Semantic Web' concept may be used for
understanding the web contents by the machine to offer
intelligent services in an efficient way with a meaningful
knowledge representation. The data retrieval in the traditional
web source is focused on 'page ranking' techniques, whereas in
the semantic web the data retrieval processes are based on the
‘concept based learning'. The proposed work is aimed at the
development of a new framework for automatic generation of
ontology and RDF to some real time Web data, extracted from
multiple repositories by tracing their URI’s and Text Documents.
Improved inverted indexing technique is applied for ontology
generation and turtle notation is used for RDF notation. A
program is written for validating the extracted data from
multiple repositories by removing unwanted data and considering
only the document section of the web page.
Index Terms— Semantic Web, Resource description
framework, Ontology, Improved inverted indexing technique,
Knowledge management.
I. INTRODUCTION
World Wide Web (WWW) is considered as a global
information repository that identifies documents and other web
resources by Uniform Resource Locators, interlinked by
hypertext links. Search engines are used to retrieve the
information from the web. Data overburden is the most
concerning issue in these days for the existing system.
Evolution of web includes the web versions of web 1.0, 2.0
etc. In this series, the web version 3.0 is referred to as semantic
web [1] is evolved as a knowledge management support across
the globe. Search engines should be enriched with semantic
web capabilities that analyze webpage content and provide
more relevant results corresponding to the user query.
Semantic web standards include resource description
framework (RDF), web ontology, RDF Schema and rule
interchange format (RIF) for handling data. Resource
description framework (RDF) provides a conceptual
description of information for representing the web resources
like Turtle syntax, N-Triples etc. Resource Description
Framework (RDF) describes data on the Web in graph form
[2]. Ontologies consist of the finite set of terms, relationships,
constraints and axioms [3]. Ontologies have proven to be
useful for effective knowledge modeling and information
retrieval. The remaining paper is arranged as follows: In
Section 2 the related work is presented. The proposed work
and its methodology are discussed in Section 3 & 4. The
results are presented in Section 5. Conclusions are given in
Section 6.
II. RELATED WORK
M.S.P.Babu et.al [4] provided the overview of some of the
semantic search engines that yield unique search experience
for users. Wilkinson et.al [5] proposed an information retrieval
system using document structure. Amel Grissa Touzi et.al [6]
suggested the Fuzzy Ontology of Data mining (FODM) for
processing automated generation of ontologies in the domain
of data mining. Amira Aloui et.al [7] implemented a plugin
named “FO-FQ Tab plug-in”, which can be integrated with
protégé editor for building the fuzzy ontologies from large
databases. To overcome the drawbacks of the existing system
for accessing the related science information, M.S.P.Babu et.al
[8] proposed a new framework for automatic generation of
ontology and RDF for real-time web data. Tahani Alsubait
et.al [9] developed the e-learning suite, with the set of
questions designed using ontological representation.
A.H.M.Rupasingha et.al [10] suggested that the performance
of the ontology generation is always dependent on the
specificity of the terms. Seongwook Youn et.al [11] discussed
pros and cons of tools like protégé 2000, OilEd, Apollo,
OntoLingua, Onto Edit, webODE, KAON, ICOM, DEO,
webOnto that is used for ontology creation.Kgotatso Desmond
Mogotlane et.al[12] presented a comparative study of plugins
of protege tool like DB2OWL and Data Master.
Sudeepthi Govathoti1, M.S. Prasasd Babu2
Research Scholar, Andhra University, Visakhapatnam &Assistant Professor, Anurag Group of Institutions, Hyderabad,
Professor, Department of CS&SE, Andhra University, Visakhapatnam,
sudeepthicse@cvsr.ac.in1
, profmspbabu@gmail.com2
An Implementation of a New Framework for
Automatic Generation of Ontology and RDF to
Real Time Web and Journal Data
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
89 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. III. PROPOSED WORK
Semantic web capabilities like RDF & ontology are applied
to enrich the knowledge. The proposed work is an
implementation of the framework proposed by the authors [8].
The framework is designed with reference to the semantic web
Stack. It is carried out in two phases, namely Data extraction
phase and Data representation phase. Web scraping is
performed using HTML parsing technique in data extraction
phase by giving sample search query as an input to multiple
repositories. DOM parsing and HTML parsing techniques are
applied to validate the data retrieved from multiple repositories
by considering only the document section of the webpage.
Extensible markup language (XML) is the base for the
semantic web representations; the validated information is
converted into semi-structured notations by using XSD
declaration from DOM tree and passed as an input for the next
layers of the proposed framework. XML notation is given as
an input to data representation phase. RDF notation is
generated and represented in graphical form using Graphviz
tool. A textual representation of RDF graph is provided using
Turtle, the Terse RDF Triple Language. Improved Inverted
Indexing technique is applied for ontological representation of
words by excluding the stop words.
Figure 1: Proposed framework
IV METHODOLOGY
Implementation of the framework proposed in Section III will
be carried out in two phases namely data extraction and data
representation phases. The details are given below.
Phase 1: Data extraction
Data extraction phase performs web scraping from multiple
repositories and stores the scraped data into the database. The
Data extraction phase is sub- divided into three steps namely
web scraping, data validation, XML Conversion. The scraped
data is further validated by removing the unwanted data in the
considering document section of web page. The data stored in
table format in the database, after the data validation process,
is converted into the Semi-Structured Notations (i.e. XML
Notations) and passed as an input to the data representation
phase.
Step 1: Web Scraping
Web scraping, also be referred as screen scraping or Web
harvesting, is used to fetch and extract the data from a web
document using HTML parsing techniques. Here Web pages
are crawled and the content of the Web page is extracted. The
data in the Web page includes three sections namely: Web
page statistics bar, document section and descriptive section.
The three items are stored in a database as three different
attributes in a database table. HTML parsing technique is used
for scraping data from the web documents is shown in Fig 2:
Figure 2: Content of the Web page
Step 2: Data Validation
In the Data Validation Step, the data collected from step 1 is
validated using HTML and DOM parsing techniques. Here
unwanted data is removed and the necessary portion of URLs
is retained. In this step web status bar and descriptive sections
are removed in the database table. The validated data is stored
in a database. Document section displays the results in the
form of page title, URL, Snippet (description) for the given
search query.
Figure 3: Content of the Web page after Data validation
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
90 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. Descriptive section provides the Wikipedia information
about the input query. The Data validation process is carried
out by considering only the document section of the web page
as shown in Fig: 3.
Step.3: XML Conversion
In XML Conversion Step the data, validated in step 2, is converted
into a DOM tree using XML Schema Definition (XSD). The
conversion is performed on data that is validated and stored in
database by considering each individual field/ attribute into
namespace convention. The XSD declaration of DOM tree
has hierarchical structures which have root node, representing
the search key word and three child nodes, representing Title,
URL and Description respectively. The XSD declaration of
the DOM tree with an example is shown in Fig 4:
Figure 4: XSD Declaration of DOM tree
Phase2: Data representation
Data extracted from steps 1,2 and 3 is maintained in an XML
format and is given as an input to data representation phase. In
Semantic Web architecture, the major source of data
representation imposes RDF-ization and Ontology generation.
Hence the data representation phase is sub- divided into two
steps namely RDF-ization and ontology generation, which are
explained in detail in step 4 & 5 respectively.
Step 4: RDF-ization
The Resource Description Framework (RDF) is the basic
building block in semantic web, promoting conceptual
modeling of web data [13]. The RDF-ization process is carried
out using Turtle notation and Graphviz tool. In this step the
XML notation data stored in extraction phase is given as an
input to RDF-ization. The RDF notation is visualized in the
form of RDF graph using Graphviz tool. Decomposition of
tuple creates a new blank node corresponding to the row and a
new triple set is obtained. Each tuple in a relational database is
decomposed as RDF triples, namely: the title is taken as
subject, URL is considered as predicate and description is
taken as object. A node can be a URI reference, literal or the
blank node. The graph in Fig: 5 is an example of RDF-ization
process of a semantic net.
Figure 5: RDF Triple
The triple is represented as a <subject, predicate, object>
format by exploring the relationship among the nodes [14].
The XML conversion carried out in step 3 is represented in
RDF syntax using Turtle notation from the convention
specified in “http://www.w3.org/1999/02/22-rdf-syntax-ns#“
as shown in Fig 6:
Fig-6: RDF-ization
The <rdf: Description> element provides the description of
resource identified by <rdf: about> attribute. The tags <rbss:
title>, <rbss: keyword>, <rbss: URL> are the properties of the
resource identified. The RDF represented in turtle notation is
visualized in a graphical format using Graphviz tool .It is open
source software that is used for generating graphs.
Step 5: Ontology Generation
Ontology is defined as a formal specification of
conceptualization of the domain of Interest. In ontology
generation step, the RDF notation obtained from step 4 is used
to create a vocabulary of words using improved inverted
indexing algorithm. Improved Inverted Indexing algorithm is
employed on real time web data collected from multiple
repositories and text documents. The words from the
description tags are extracted by excluding the stop words and
frequency count/Term frequency (TF) of each word is
maintained. The illustration of improved inverted indexing
algorithm is presented as follows:
Algorithm: Improved inverted indexing
Input: Database D= {T1, T2…Tn}, Storage Database
Output: Attributes {A1, A2…An}, where Ai, for i=1,2…n
are representing ontology vocabulary.
Parameters: Swrdk= Array of Stop Words
attsLq= Snippet attribute
Wordsk= Words stemmed from snippet attribute
attsLf= Word frequencies after stemming
attsL= ontology along with the frequencies count.
1. Swrdk={};
2. for i=0;i<=i+,i≤ D do
3. attsLq=Query Coverage(D,i);
4. Wordsk=Words Separate(attsLq,Swrdk);
5. attsLf= Words Usage frequency(D,attsLq,WordsK);
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
91 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. 6. attsL=attsLqU attsLf
7. f=highest_freq(attsL)
8. if (f<freq(attsL)) then
9. sort(Wordsk,freq(attsL))
10. end if
11.end for
12.return (Wordsk,freq(attsL)
V. RESULTS AND DISCUSSION
The Semantic web stack proposed by Tim Berners Lee [15] is
implemented by using the frame work proposed by the authors
in section III. It is implemented in PHP version: 5 (Open
Source scripting language) and MySQL version: 5 (open-
source relational database management system) environments.
It is tested on an input with test dataset comprising of sample
search keywords. Response time is the amount of time that
elapses from the receipt of the query until the results are
displayed to user. Response time can be measured on server
side or client side as shown in Fig 7.
Figure 7: Response Time
Throughput is defined as number of queries executed per
second (qps). Throughput and response time are observed for
the set of retrieval operations with respect to the page load
times. The performance of framework implemented with
respect to throughput is shown in Fig 8.
Figure 8: Throughput
.
A sample search query is given as an input and web scraping
results are shown in Fig 9:
Figure 9: Web scraping results
Web scraping performance is evaluated by considering the
following parameters like database size and count of URL’s
extracted which is shown in Fig 10:
Figure 10: Web scraping analysis
Scraped data from multiple repositories is given as an input to
data validation step. The validated data is obtained as an
output to data validation process by applying HTML and
DOM parsing technique. Data validation considers only the
document section of a web page. Data validation results are
shown in Fig 11:
Figure 11: Data validation results
The performance of data validation processing with wanted
and unwanted data from the scraped data by considering the
following parameters like database size and count of URL’s is
shown in Fig 12:
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
92 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
5. Figure12: Data Validation
The validated data stored in database is converted into the
XML notations by applying XSD declaration as shown in
Fig 13:
Figure13: XML Conversion
Resource Description Framework (RDF) is a recommended
standard of World Wide Web Consortium (W3C) [16]. RDF
representation of data in turtle form is shown in Fig 14:
Figure14: RDF-ization results Turtle form
RDF generation for sample relation named “testrdf” which has
an attributes as <name, description, freq> is considered. The
“testrdf” represents the relation name is considered as a class
in an RDF graph and has set of three nodes that are connecting
the testrdf in depth wise manner represents the tuple of a
relation as shown in Fig 15:
Figure15: RDF Graph generation
Ontology generation for the data obtained from multiple
repositories as well as the text file. Improved inverted
indexing technique is applied for extracting the words with
their frequencies discarding the stop words, in the order of
highest precedence. The result of ontology generation for real
time web data with frequency is shown in Fig 16:
Figure16: Ontology Generation for Real time web data
The highest frequency word is considered as a frequent search
term for the purpose of rule framing using description logic.
The rule mapping is done for the efficient retrieval operation
which will be future work. The result of ontology generation
for text document is shown in Fig 17.
Figure17: Ontology Generation for text document
VI. CONCLUSION
The evolution of web has taken many forms namely web 1.0,
web 2.0, web 3.0 , web 4.0 which lead to high-end
information retrieval systems using semantic web. The existing
traditional system collects the data from search engines is
exhibiting average performance in retrieval. Implementation of
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
93 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
6. proposed framework for automatic generation of ontology’s
and RDF improves the performance of traditional search
engines by incorporating semantic capabilities. It includes the
application of HTML parsing technique, DOM parsing
techniques and Turtle notation of graphviz tool. The algorithm
improves information retrieval in Semantic Web and Expert
Systems. The future work includes applying efficient
cryptography for securing database and rule framing for the
design of an expert system.
REFERENCES
[1] Sareh Aghaei, Mohammad Ali Nematbakhsh and Hadi Khosravi
Farsani, “Evolution Of The World Wide Web: From Web 1.0 To
Web 4.0”, International Journal of Web & Semantic Technology
(IJWesT) Vol.3, No.1, January 2012.
[2] Abdeslem DENNAI, Sidi Mohammed BENSLIMANE,"Semantic
Indexing of Web Documents Based on Domain Ontology", I.J.
Information Technology and Computer Science, 2015, 02, 1-11.
[3] Seema Redekar, Vishal Chekkala, Siddhapa Gouda, Swapnil
Yalgude,"Web Search Engine Using Ontology Learning",International
Journal of Innovative Research in Computer and Communication
Engineering,Vol. 5, Issue 3, March 2017.
[4] G Sudeepthi, G Anuradha, M Surendra Prasad Babu,” A survey on
semantic web search engine” International Journal of Computer
Science Issues, 2012/3 IJCSI, Volume 9 Issue 2 Pages 241-245.
[5] R. Wilkinson, ‟Effective retrieval of structured documents‟. (S.-V. New
York, Ed.) Pages 311 – 317, 1994.
[6] Amel Grissa Touzi, Hela Ben Massoud and Alaya Ayadi,” Automatic
Ontology Generation for Data Mining Using FCA and Clustering”,
arxiv.org, no. 1311.1764.
[7] Amira Aloui, ENIT, Tunis, Tunisia,”A Fuzzy Ontology-Based Platform
for Flexible Querying”, International Journal of Service Science,
Management, Engineering, and Technology, Vol.6, Issue 3, July-
September 2015, pp 12-26.
[8] Prof. M Surendra Prasad Babu, Sudeepthi Govathoti,”A Semantic Model
for Building Integrated Ontology Databases”, 7th
IEEE International
conference on software engineering and service science.
[9] Tahani Alsubait, Bijan Parsia, Ulrike Sattler,”Ontology-Based Multiple
Choice Question Generation”, Kunstl Intell, Vol. 30, 2016, pp 183-188.
[10] Rupasingha A. H. M. Rupasingha, “Improving Web Service Clustering
through a Novel Ontology Generation Method by Domain Specificity “,
IEEE 24th International Conference on Web Services, 2017.
[11] Seongwook Youn, Anchit Arora, Preetham Chandrasekhar, Paavany
Jayanty, Ashish Mestry and Shikha Sethi,”Survey about Ontology
Development Tools for Ontology-based Knowledge Management”.
[12] Kgotatso Desmond Mogotlane, Jean Vincent Fonou-
Dombeu,”Automatic Conversion of Relational Databases into
Ontologies”.
[13] FatemeAbiri, Mohsen Kahani, FataneZarinkalam"An Entity Based RDF
Indexing Schema Using Hadoop and HBase", 2011.
[14] Faizan Shaikh, Usman A. Siddiqui, IramShahzadi, SyedUami, Zubair A.
Shaik "SWISE: Semantic Web based Intelligent Search Engine", 2010.
[15] Gopal Pandey,” The Semantic Web: An Introduction and Issues”,
International Journal of Engineering Research and Applications, Vol. 2,
Issue 1,Jan-Feb 2012, pp.780-786.
[16] Resource Description Framework (RDF). Model and Syntax
Specification. Technical
report, W3C.
Prof. M.S.Prasad Babu was born on 12 -
08-1956 in Andhra Pradesh, India. He
obtained his bachelors to doctoral degrees
from Andhra University. He has 39 years of
teaching and research experience. He guided
12 PhD's and 210 PG students for their
thesis. He was the president of ICT section
of ISCA in 2006-07. He attended about 50
National and International Conferences in
India and abroad and presented keynotes. He contributed about
250 papers National and International journals and
conferences. He developed 10 international reputed Web
portals. He won ISCA Young Scientist Award in 1986, State
Best teacher award for engineering in 2015 and Dr. Sarvepalli
Radhakrishnan Best Academician Award of Andhra
University for 2014. He was the Conference Steering Chair
for eight IEEE ICSESS of Beijing, China from 2010 to 2017.
Prof Babu is presently working as Vice Principal of AU Engg
College, Chairman, Faculty of Engineering of Andhra
University and senior Professor in Computer Science &
Systems Engineering discipline.
G. SUDEEPTHI was born in,
Rajahmundry, Andhra Pradesh, India in
1983. She received B. TECH from National
Institute of Technology Warangal, India in
2005 and M. Tech in Computer Science &
Engineering from KLC Vijayawada, India
in 2007. She is working as part time
research scholar in the Dept. of CS & SE, Andhra University
and as a Assistant Professor at Anurag Group of Institutions
Hyderabad, India. She has got more than ten years of teaching
experience. She Qualified FET-2011 Conducted by JNTUH.
She contributed seven research papers in International journals
and Presented one research paper at International conference in
Beijing, China and another in Warangal, India.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
94 https://sites.google.com/site/ijcsis/
ISSN 1947-5500