Anzeige
An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data
An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data
An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data
An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data
Anzeige
An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data
An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data
Nächste SlideShare
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
Wird geladen in ... 3
1 von 6
Anzeige

Más contenido relacionado

Similar a An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data(20)

Anzeige

An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data

  1.  Abstract— Information on the web is tremendously increasing in recent years with the faster rate. This massive or voluminous data has driven intricate problems for information retrieval and knowledge management. As the data resides in a web with several forms, the Knowledge management in the web is a challenging task. Here the novel 'Semantic Web' concept may be used for understanding the web contents by the machine to offer intelligent services in an efficient way with a meaningful knowledge representation. The data retrieval in the traditional web source is focused on 'page ranking' techniques, whereas in the semantic web the data retrieval processes are based on the ‘concept based learning'. The proposed work is aimed at the development of a new framework for automatic generation of ontology and RDF to some real time Web data, extracted from multiple repositories by tracing their URI’s and Text Documents. Improved inverted indexing technique is applied for ontology generation and turtle notation is used for RDF notation. A program is written for validating the extracted data from multiple repositories by removing unwanted data and considering only the document section of the web page. Index Terms— Semantic Web, Resource description framework, Ontology, Improved inverted indexing technique, Knowledge management. I. INTRODUCTION World Wide Web (WWW) is considered as a global information repository that identifies documents and other web resources by Uniform Resource Locators, interlinked by hypertext links. Search engines are used to retrieve the information from the web. Data overburden is the most concerning issue in these days for the existing system. Evolution of web includes the web versions of web 1.0, 2.0 etc. In this series, the web version 3.0 is referred to as semantic web [1] is evolved as a knowledge management support across the globe. Search engines should be enriched with semantic web capabilities that analyze webpage content and provide more relevant results corresponding to the user query. Semantic web standards include resource description framework (RDF), web ontology, RDF Schema and rule interchange format (RIF) for handling data. Resource description framework (RDF) provides a conceptual description of information for representing the web resources like Turtle syntax, N-Triples etc. Resource Description Framework (RDF) describes data on the Web in graph form [2]. Ontologies consist of the finite set of terms, relationships, constraints and axioms [3]. Ontologies have proven to be useful for effective knowledge modeling and information retrieval. The remaining paper is arranged as follows: In Section 2 the related work is presented. The proposed work and its methodology are discussed in Section 3 & 4. The results are presented in Section 5. Conclusions are given in Section 6. II. RELATED WORK M.S.P.Babu et.al [4] provided the overview of some of the semantic search engines that yield unique search experience for users. Wilkinson et.al [5] proposed an information retrieval system using document structure. Amel Grissa Touzi et.al [6] suggested the Fuzzy Ontology of Data mining (FODM) for processing automated generation of ontologies in the domain of data mining. Amira Aloui et.al [7] implemented a plugin named “FO-FQ Tab plug-in”, which can be integrated with protégé editor for building the fuzzy ontologies from large databases. To overcome the drawbacks of the existing system for accessing the related science information, M.S.P.Babu et.al [8] proposed a new framework for automatic generation of ontology and RDF for real-time web data. Tahani Alsubait et.al [9] developed the e-learning suite, with the set of questions designed using ontological representation. A.H.M.Rupasingha et.al [10] suggested that the performance of the ontology generation is always dependent on the specificity of the terms. Seongwook Youn et.al [11] discussed pros and cons of tools like protégé 2000, OilEd, Apollo, OntoLingua, Onto Edit, webODE, KAON, ICOM, DEO, webOnto that is used for ontology creation.Kgotatso Desmond Mogotlane et.al[12] presented a comparative study of plugins of protege tool like DB2OWL and Data Master. Sudeepthi Govathoti1, M.S. Prasasd Babu2 Research Scholar, Andhra University, Visakhapatnam &Assistant Professor, Anurag Group of Institutions, Hyderabad, Professor, Department of CS&SE, Andhra University, Visakhapatnam, sudeepthicse@cvsr.ac.in1 , profmspbabu@gmail.com2 An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real Time Web and Journal Data International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 89 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  2. III. PROPOSED WORK Semantic web capabilities like RDF & ontology are applied to enrich the knowledge. The proposed work is an implementation of the framework proposed by the authors [8]. The framework is designed with reference to the semantic web Stack. It is carried out in two phases, namely Data extraction phase and Data representation phase. Web scraping is performed using HTML parsing technique in data extraction phase by giving sample search query as an input to multiple repositories. DOM parsing and HTML parsing techniques are applied to validate the data retrieved from multiple repositories by considering only the document section of the webpage. Extensible markup language (XML) is the base for the semantic web representations; the validated information is converted into semi-structured notations by using XSD declaration from DOM tree and passed as an input for the next layers of the proposed framework. XML notation is given as an input to data representation phase. RDF notation is generated and represented in graphical form using Graphviz tool. A textual representation of RDF graph is provided using Turtle, the Terse RDF Triple Language. Improved Inverted Indexing technique is applied for ontological representation of words by excluding the stop words. Figure 1: Proposed framework IV METHODOLOGY Implementation of the framework proposed in Section III will be carried out in two phases namely data extraction and data representation phases. The details are given below. Phase 1: Data extraction Data extraction phase performs web scraping from multiple repositories and stores the scraped data into the database. The Data extraction phase is sub- divided into three steps namely web scraping, data validation, XML Conversion. The scraped data is further validated by removing the unwanted data in the considering document section of web page. The data stored in table format in the database, after the data validation process, is converted into the Semi-Structured Notations (i.e. XML Notations) and passed as an input to the data representation phase. Step 1: Web Scraping Web scraping, also be referred as screen scraping or Web harvesting, is used to fetch and extract the data from a web document using HTML parsing techniques. Here Web pages are crawled and the content of the Web page is extracted. The data in the Web page includes three sections namely: Web page statistics bar, document section and descriptive section. The three items are stored in a database as three different attributes in a database table. HTML parsing technique is used for scraping data from the web documents is shown in Fig 2: Figure 2: Content of the Web page Step 2: Data Validation In the Data Validation Step, the data collected from step 1 is validated using HTML and DOM parsing techniques. Here unwanted data is removed and the necessary portion of URLs is retained. In this step web status bar and descriptive sections are removed in the database table. The validated data is stored in a database. Document section displays the results in the form of page title, URL, Snippet (description) for the given search query. Figure 3: Content of the Web page after Data validation International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 90 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  3. Descriptive section provides the Wikipedia information about the input query. The Data validation process is carried out by considering only the document section of the web page as shown in Fig: 3. Step.3: XML Conversion In XML Conversion Step the data, validated in step 2, is converted into a DOM tree using XML Schema Definition (XSD). The conversion is performed on data that is validated and stored in database by considering each individual field/ attribute into namespace convention. The XSD declaration of DOM tree has hierarchical structures which have root node, representing the search key word and three child nodes, representing Title, URL and Description respectively. The XSD declaration of the DOM tree with an example is shown in Fig 4: Figure 4: XSD Declaration of DOM tree Phase2: Data representation Data extracted from steps 1,2 and 3 is maintained in an XML format and is given as an input to data representation phase. In Semantic Web architecture, the major source of data representation imposes RDF-ization and Ontology generation. Hence the data representation phase is sub- divided into two steps namely RDF-ization and ontology generation, which are explained in detail in step 4 & 5 respectively. Step 4: RDF-ization The Resource Description Framework (RDF) is the basic building block in semantic web, promoting conceptual modeling of web data [13]. The RDF-ization process is carried out using Turtle notation and Graphviz tool. In this step the XML notation data stored in extraction phase is given as an input to RDF-ization. The RDF notation is visualized in the form of RDF graph using Graphviz tool. Decomposition of tuple creates a new blank node corresponding to the row and a new triple set is obtained. Each tuple in a relational database is decomposed as RDF triples, namely: the title is taken as subject, URL is considered as predicate and description is taken as object. A node can be a URI reference, literal or the blank node. The graph in Fig: 5 is an example of RDF-ization process of a semantic net. Figure 5: RDF Triple The triple is represented as a <subject, predicate, object> format by exploring the relationship among the nodes [14]. The XML conversion carried out in step 3 is represented in RDF syntax using Turtle notation from the convention specified in “http://www.w3.org/1999/02/22-rdf-syntax-ns#“ as shown in Fig 6: Fig-6: RDF-ization The <rdf: Description> element provides the description of resource identified by <rdf: about> attribute. The tags <rbss: title>, <rbss: keyword>, <rbss: URL> are the properties of the resource identified. The RDF represented in turtle notation is visualized in a graphical format using Graphviz tool .It is open source software that is used for generating graphs. Step 5: Ontology Generation Ontology is defined as a formal specification of conceptualization of the domain of Interest. In ontology generation step, the RDF notation obtained from step 4 is used to create a vocabulary of words using improved inverted indexing algorithm. Improved Inverted Indexing algorithm is employed on real time web data collected from multiple repositories and text documents. The words from the description tags are extracted by excluding the stop words and frequency count/Term frequency (TF) of each word is maintained. The illustration of improved inverted indexing algorithm is presented as follows: Algorithm: Improved inverted indexing Input: Database D= {T1, T2…Tn}, Storage Database Output: Attributes {A1, A2…An}, where Ai, for i=1,2…n are representing ontology vocabulary. Parameters: Swrdk= Array of Stop Words attsLq= Snippet attribute Wordsk= Words stemmed from snippet attribute attsLf= Word frequencies after stemming attsL= ontology along with the frequencies count. 1. Swrdk={}; 2. for i=0;i<=i+,i≤ D do 3. attsLq=Query Coverage(D,i); 4. Wordsk=Words Separate(attsLq,Swrdk); 5. attsLf= Words Usage frequency(D,attsLq,WordsK); International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 91 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  4. 6. attsL=attsLqU attsLf 7. f=highest_freq(attsL) 8. if (f<freq(attsL)) then 9. sort(Wordsk,freq(attsL)) 10. end if 11.end for 12.return (Wordsk,freq(attsL) V. RESULTS AND DISCUSSION The Semantic web stack proposed by Tim Berners Lee [15] is implemented by using the frame work proposed by the authors in section III. It is implemented in PHP version: 5 (Open Source scripting language) and MySQL version: 5 (open- source relational database management system) environments. It is tested on an input with test dataset comprising of sample search keywords. Response time is the amount of time that elapses from the receipt of the query until the results are displayed to user. Response time can be measured on server side or client side as shown in Fig 7. Figure 7: Response Time Throughput is defined as number of queries executed per second (qps). Throughput and response time are observed for the set of retrieval operations with respect to the page load times. The performance of framework implemented with respect to throughput is shown in Fig 8. Figure 8: Throughput . A sample search query is given as an input and web scraping results are shown in Fig 9: Figure 9: Web scraping results Web scraping performance is evaluated by considering the following parameters like database size and count of URL’s extracted which is shown in Fig 10: Figure 10: Web scraping analysis Scraped data from multiple repositories is given as an input to data validation step. The validated data is obtained as an output to data validation process by applying HTML and DOM parsing technique. Data validation considers only the document section of a web page. Data validation results are shown in Fig 11: Figure 11: Data validation results The performance of data validation processing with wanted and unwanted data from the scraped data by considering the following parameters like database size and count of URL’s is shown in Fig 12: International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 92 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  5. Figure12: Data Validation The validated data stored in database is converted into the XML notations by applying XSD declaration as shown in Fig 13: Figure13: XML Conversion Resource Description Framework (RDF) is a recommended standard of World Wide Web Consortium (W3C) [16]. RDF representation of data in turtle form is shown in Fig 14: Figure14: RDF-ization results Turtle form RDF generation for sample relation named “testrdf” which has an attributes as <name, description, freq> is considered. The “testrdf” represents the relation name is considered as a class in an RDF graph and has set of three nodes that are connecting the testrdf in depth wise manner represents the tuple of a relation as shown in Fig 15: Figure15: RDF Graph generation Ontology generation for the data obtained from multiple repositories as well as the text file. Improved inverted indexing technique is applied for extracting the words with their frequencies discarding the stop words, in the order of highest precedence. The result of ontology generation for real time web data with frequency is shown in Fig 16: Figure16: Ontology Generation for Real time web data The highest frequency word is considered as a frequent search term for the purpose of rule framing using description logic. The rule mapping is done for the efficient retrieval operation which will be future work. The result of ontology generation for text document is shown in Fig 17. Figure17: Ontology Generation for text document VI. CONCLUSION The evolution of web has taken many forms namely web 1.0, web 2.0, web 3.0 , web 4.0 which lead to high-end information retrieval systems using semantic web. The existing traditional system collects the data from search engines is exhibiting average performance in retrieval. Implementation of International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 93 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  6. proposed framework for automatic generation of ontology’s and RDF improves the performance of traditional search engines by incorporating semantic capabilities. It includes the application of HTML parsing technique, DOM parsing techniques and Turtle notation of graphviz tool. The algorithm improves information retrieval in Semantic Web and Expert Systems. The future work includes applying efficient cryptography for securing database and rule framing for the design of an expert system. REFERENCES [1] Sareh Aghaei, Mohammad Ali Nematbakhsh and Hadi Khosravi Farsani, “Evolution Of The World Wide Web: From Web 1.0 To Web 4.0”, International Journal of Web & Semantic Technology (IJWesT) Vol.3, No.1, January 2012. [2] Abdeslem DENNAI, Sidi Mohammed BENSLIMANE,"Semantic Indexing of Web Documents Based on Domain Ontology", I.J. Information Technology and Computer Science, 2015, 02, 1-11. [3] Seema Redekar, Vishal Chekkala, Siddhapa Gouda, Swapnil Yalgude,"Web Search Engine Using Ontology Learning",International Journal of Innovative Research in Computer and Communication Engineering,Vol. 5, Issue 3, March 2017. [4] G Sudeepthi, G Anuradha, M Surendra Prasad Babu,” A survey on semantic web search engine” International Journal of Computer Science Issues, 2012/3 IJCSI, Volume 9 Issue 2 Pages 241-245. [5] R. Wilkinson, ‟Effective retrieval of structured documents‟. (S.-V. New York, Ed.) Pages 311 – 317, 1994. [6] Amel Grissa Touzi, Hela Ben Massoud and Alaya Ayadi,” Automatic Ontology Generation for Data Mining Using FCA and Clustering”, arxiv.org, no. 1311.1764. [7] Amira Aloui, ENIT, Tunis, Tunisia,”A Fuzzy Ontology-Based Platform for Flexible Querying”, International Journal of Service Science, Management, Engineering, and Technology, Vol.6, Issue 3, July- September 2015, pp 12-26. [8] Prof. M Surendra Prasad Babu, Sudeepthi Govathoti,”A Semantic Model for Building Integrated Ontology Databases”, 7th IEEE International conference on software engineering and service science. [9] Tahani Alsubait, Bijan Parsia, Ulrike Sattler,”Ontology-Based Multiple Choice Question Generation”, Kunstl Intell, Vol. 30, 2016, pp 183-188. [10] Rupasingha A. H. M. Rupasingha, “Improving Web Service Clustering through a Novel Ontology Generation Method by Domain Specificity “, IEEE 24th International Conference on Web Services, 2017. [11] Seongwook Youn, Anchit Arora, Preetham Chandrasekhar, Paavany Jayanty, Ashish Mestry and Shikha Sethi,”Survey about Ontology Development Tools for Ontology-based Knowledge Management”. [12] Kgotatso Desmond Mogotlane, Jean Vincent Fonou- Dombeu,”Automatic Conversion of Relational Databases into Ontologies”. [13] FatemeAbiri, Mohsen Kahani, FataneZarinkalam"An Entity Based RDF Indexing Schema Using Hadoop and HBase", 2011. [14] Faizan Shaikh, Usman A. Siddiqui, IramShahzadi, SyedUami, Zubair A. Shaik "SWISE: Semantic Web based Intelligent Search Engine", 2010. [15] Gopal Pandey,” The Semantic Web: An Introduction and Issues”, International Journal of Engineering Research and Applications, Vol. 2, Issue 1,Jan-Feb 2012, pp.780-786. [16] Resource Description Framework (RDF). Model and Syntax Specification. Technical report, W3C. Prof. M.S.Prasad Babu was born on 12 - 08-1956 in Andhra Pradesh, India. He obtained his bachelors to doctoral degrees from Andhra University. He has 39 years of teaching and research experience. He guided 12 PhD's and 210 PG students for their thesis. He was the president of ICT section of ISCA in 2006-07. He attended about 50 National and International Conferences in India and abroad and presented keynotes. He contributed about 250 papers National and International journals and conferences. He developed 10 international reputed Web portals. He won ISCA Young Scientist Award in 1986, State Best teacher award for engineering in 2015 and Dr. Sarvepalli Radhakrishnan Best Academician Award of Andhra University for 2014. He was the Conference Steering Chair for eight IEEE ICSESS of Beijing, China from 2010 to 2017. Prof Babu is presently working as Vice Principal of AU Engg College, Chairman, Faculty of Engineering of Andhra University and senior Professor in Computer Science & Systems Engineering discipline. G. SUDEEPTHI was born in, Rajahmundry, Andhra Pradesh, India in 1983. She received B. TECH from National Institute of Technology Warangal, India in 2005 and M. Tech in Computer Science & Engineering from KLC Vijayawada, India in 2007. She is working as part time research scholar in the Dept. of CS & SE, Andhra University and as a Assistant Professor at Anurag Group of Institutions Hyderabad, India. She has got more than ten years of teaching experience. She Qualified FET-2011 Conducted by JNTUH. She contributed seven research papers in International journals and Presented one research paper at International conference in Beijing, China and another in Warangal, India. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 94 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
Anzeige