This document provides an overview of nature-inspired methods that have been used in the Semantic Web for tasks like information retrieval, extraction, clustering, and personalization. It discusses how genetic algorithms, neural networks, fuzzy logic, and rough sets have helped with problems in these areas by modeling complex relationships and uncertainty. The document also describes approaches for representing uncertainty in ontologies, including using Bayesian networks to quantify overlap between concepts.
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Nature-Inspired Methods for Semantic Web
1. Nature-Inspired Methods for the Semantic Web
Claudiu Mih˘il˘ and Magdalena Jitc˘
a a a
Faculty of Computer Science,
”Al.I. Cuza” University of Ia¸i,
s
16, G-ral Berthelot Street,
700483 Ia¸i, Romania
s
{claudiu.mihaila, magdalena.jitca}@info.uaic.ro
Abstract. More recently, significant research efforts are made towards
uncertainty representation and reasoning in ontologies for the Semantic
Web. This work reports on the contributions using methods inspired from
nature in multiple Semantic Web domains, such as information retrieval
and extraction, clustering, and personalisation. Furthermore, it describes
briefly the attempts of modelling uncertainty.
Key words: semantic Web, methods inspired from nature, soft com-
puting, Web mining, uncertainty modelling
1 Introduction
In the context of an ever-expanding World Wide Web (www), more than 100
million registered domains [1], over 25 billion indexed pages [2], and more than
one trillion unique urls [3] have been reported. The variety of information avail-
able on the web has led the researchers to multiple research directions, one of the
most important being related to the difference between human- and machine-
understandable information and another related to information uncertainty. The
Semantic Web models available until the past few years have included little ex-
plicit information about uncertainty representation and processing because of
the concerns raised by the scalability and computational complexity of this pos-
sible approach. Much research interest focusses on the techniques for extracting
incomplete, partial or uncertain knowledge, as well as on handling uncertainty
when representing extracted information using ontologies.
This report provides an overview of the contributions to this research area
regarding the development or improvement of the currently available Semantic
Web tools and models by means of soft computing. It also presents the work
dealing with representation of uncertain knowledge and reasoning in presence of
uncertainty.
In the near future, semantic web systems are expected to integrate a consis-
tent set of the available soft computing techniques, including uncertainty repre-
sentations, statistical measures, fuzzy rules or belief networks for transmission
across the Web.
2. In the first part of the report, we describe the uses of nature-inspired methods
in the Web and then in the Semantic Web. In the second part, we describe the
attempts of modelling uncertainty.
2 Current use of nature-inspired methods in the Web
Due to the vastness and diversity of the Web, it has become impossible to be able
to create software which comprises it completely and which is able to understand
correctly the information it contains. The lack of structure and patterns and
the large amount of data has led researchers into developing nature inspired
methodologies, which can find, most of the times, an optimal solution to NP-
complete problems.
Methods inspired from nature are used in various Web domains. For example,
SnapAd.com1 uses genetic algorithms to produce advertisements. This service
begins with a base population of ad variations and, after employing the genetic
algorithm, manages to select their best-performing characteristics in order to
create an impressive result.
Other works, such as [4, 5], use genetic algorithms to determine clusters of
similar users in social networks. The algorithms use fitness functions which mea-
sure the number of intra- and inter-connections for groups and variation opera-
tors which reduce the space of possible solutions in an appreciable manner.
In addition, nature inspired methods have been successfully used in search
engines [6], information retrieval [7], and question answering [8] systems.
3 Nature-inspired methods in the Semantic Web
Web mining is the area of data mining which deals with the analysis and ex-
traction of interesting knowledge from the World Wide Web. However, when
working with large amounts of mixed and poorly tagged information, which is
constantly changing, problems are very likely to arise. According to [9], the main
problems regard handling context sensitive queries, summarisation, deduction,
personalisation and learning. Fig. 1 depicts the subtasks of web mining, which
will later be discussed along with the problems they might raise.
Fig. 1. Web mining subtasks
1
http://www.snapads.com/
3. Information retrieval The issues which may occur during the task of infor-
mation retrieval (ir) are related to the uncertainty and the accuracy of the user
queries, as well as to the deduction and decision capabilities of the system. Sev-
eral approaches of the fuzzy logic which try to solve the issues of formulating
queries in relation to the relevance of the resulting documents with respect to
the input query are included in [9]. The results show that systems based on fuzzy
Boolean ir models would be most suitable for representing both the document
contents and the information needs.
Artificial neural networks (ann) also provide a convenient method of knowl-
edge representation for ir applications, as their learning ability eases the task of
implementing adaptive systems. The system [10] first encodes the initial knowl-
edge base, and then constantly refines it by means of the neural networks. The
advantages of this approach is that the correctness of the initial information does
not directly influence the output, as this information is improved at each step
by extracting rules from the knowledge-based nns.
The genetic algorithms (ga) that have been used for this purpose assign
so-called relevance coefficients to the html tags, which are deduced from the
training text set. As regards the sub-task of query optimisation, gas have been
used at reweighting the document indexing without having to expand the queries
[11].
A novel approach using evolutionary algorithms in a distributed environment
is reported in [12]. Their intention is to determine to which information sources
the queries should be sequentially sent. By combining a query sampling method
and an evolutionary method, the resource descriptions are retrieved and inte-
grated optimally. The process of ontological mediation with query-based sam-
pling is depicted in Fig. 2 [13]. While the crawlers sample the resource descrip-
tions of the information sources, the mediator conducts the process of ontological
mediation for the integration of the obtained ontologies into a single large one
[14].
Fig. 2. A whole process of ontological mediation with query-based sampling. [13]
Moreover, due to the fact that crawlers continue obtaining semantic informa-
tion from the sources, the ontologies evolve over time. This process is achieved
by employing a genetic algorithm within the mediator, which determines the
best mapping between the obtained semantic substructures and the estimated
4. local ontology. The results of the conducted experiments prove the scalability of
the entire contextual mediation.
Another technique that can be used to solve the task of approximate infor-
mation retrieval is the rough sets (rs) theory [9], considering that the set of
relevant documents may be less accurate and that it can be represented by its
”upper” and ”lower” approximations. The lower one corresponds to the most
specific set, that is definitely relevant to the searched item and the upper one
refers to the most general set that may possibly be relevant. This concept can
further be used at improving the efficiency of ir systems by implementing a
dynamic and focused search, based on the above described technique.
Information extraction Information extraction (ie) is the task of identifying
specific fragments of a single document representing its core semantic content.
The most effective methods of ie discovered until now involve working with
wrappers, procedures for extracting information from web resources. However,
they have the drawback of being particular to a certain resource, hence they
cannot be applied on every available web resource.
This performance can be improved by using nns with a boosted wrapper
induction (bwi) technique [15]. By using the AdaBoost algorithm, bwi repeat-
edly reweights the training examples so that subsequent patterns handle training
examples missed by previous rules. The results of the learning process are com-
parable to the ones obtained with the hmm technique for learning and then
extracting the information [16].
Another approach is that of Inductive Logic Programming [17], in which
logical rules are learned in order to identify phrases to be extracted from a
document [18].
Clustering Clustering is an important issue while dealing with web documents
in order to cover tasks such as measuring the relevance or the speed, obtaining
browsable summaries or working with overlapping data. However, there are still
some unresolved problems regarding efficient clustering arising from the nature
of web data itself. A fuzzy clustering technique for web log data mining, based
on an algorithm for clustering user session, is presented in [9]. It analyses the
structure of a certain website and the urls in order to be able to compute the
degree of similarity between two user sessions.
The ability of nns in modelling complex nonlinear functions can also be used
for this task [9], for example in classifying web pages, as well as user patterns,
in both supervised and unsupervised manners.
Another soft computing method used for document clustering is rs theory,
among which variable precision and tolerance relations are significant for this
task. In particular, rough mereology has been used for mining multimedia ob-
jects, as well as web graphs or semantic structures [19].
An evolutionary approach for the conceptual clustering of semantic knowl-
edge bases is presented in [20]. Their method can be applied to multi-relational
5. knowledge bases to exploit effectively and, most importantly, language-indepen-
dently a semi-distance dissimilarity measure defined for the space of individual
resources. Such clusterings of semantically annotated resources present a high
degree of interest due to their ability of defining new emerging concepts (con-
cept formation), which can induce new concept definitions or a refinement of
existing ones (ontology evolution). The evolutionary algorithm they developed,
which extends distance-based clustering procedures employing medoids as cluster
prototypes, remains stable along multiple repetitions, converging towards clus-
terings of comparable quality with generally the same number of clusters, and
avoiding being caught in points of local minima. Furthermore, the work could
be extended in order to create hierarchies of clusters of specific granularity.
Personalisation Personalisation involves using the technology to accommodate
the differences between individuals, but in this context it refers to the fact the
retrieved content and the search results should be according to users’ preferences
and interests. The most effective way of learning the user profiles by using train-
ing data collected from several users or systems. ”Syskill and Webert”, an agent
which learns user profiles using the Bayesian classifier, is introduced in [9]. As
an extension, it can be used to determine whether the users would have interest
for a similar page. This decision is possible due to analysing the html source of
a page, but the prerequisite for this is the previous retrieval of the considered
page.
An improved way of obtaining quality and useful ”aggregate user profiles”
from patterns is given in [21]. This approach relies on two techniques involving
clustering of both user transactions and page views with the purpose of obtaining
the overlapping aggregate profiles, which can later be used by recommender
systems for real-time personalisation.
3.1 Uncertainty modelling
The issue of uncertainty on the Semantic Web is still a challenging research field,
as this domain deals with imprecise information from different applications, each
with its special knowledge representation needs (e.g., multimedia processing,
face recognition, gps systems). To deal with uncertainty in the Semantic Web
and its applications, many researchers have proposed extending owl and the
Description Logic (dl) formalisms with special mathematical frameworks.
A probabilistic method, based on Bayesian networks (bn), is proposed in [22],
to represent and compute the overlap in concept hierarchies. The overlap between
a pair of concepts (selected vs. referred) is a numeric value in the [0, 1] range
and indicates how well a data item matches the query concept. It approaches
0 in case of disjoint concepts and 1 when the referred concept is subsumed by
the selected one. Based upon the possible relations between concepts a graph
notation has been used for representing the degree of overlap in the concept
hierarchy. The goal of this approach is to represent the overlap between concepts
from a taxonomic structure, without requiring the user any prior knowledge of
probability theory or bns.
6. A probabilistic framework for modelling uncertainty in semantic web ontolo-
gies based on Bayesian networks has been developed in [23]. Their goal is to
convert any owl ontology into a bn by using probabilistic extensions to de-
scription logics. The translated bn is semantically consistent with the original
ontology and satisfies all the given probabilistic constraints. The drawback of
this approach is that the probabilistic information must be added to the on-
tology by the human modeller and this task requires knowledge of probability
theory. This framework, called BayesOWL, is currently at version 1.0, and it is
available for download2 as a Java extension.
More recently, a World Wide Web Consortium (w3c) Incubator Group on
Uncertainty Reasoning for the World Wide Web was created in order to describe
situations where uncertainty reasoning would improve majorly information ex-
traction, to identify methodologies which can be applied to these cases, and to
develop a standardised representation of uncertainty [24]. The most commonly
used approaches to uncertainty for the www that the group identified are prob-
abilistic theories (e.g., bn), fuzzy logic, and belief functions. After analysing 16
use cases, the group developed an uncertainty ontology and concluded that the
uncertainty came either from data, or from reasoning.
4 Conclusions
In this report, we have summarised the achievements using soft computing
methodologies in the context of the Semantic Web and briefly described their
principles. We have then summarily introduced uncertainty modelling and gave
an overview of some approaches.
Many important aspects still remain open for future research. Specifically,
there is a need for scalable formalisms to support uncertainty and vagueness in
ontology languages, and implementations of these formalisms.
References
1. DomainTools, LLC: Domain Counts & Internet Statistics.
http://www.domaintools.com/internet-statistics/ Accessed 10 January 2010.
2. de Kunder, M.: The size of the World Wide Web.
http://www.worldwidewebsize.com/ Accessed 10 January 2010.
3. Alpert, J., Hajaj, N.: We knew the web was big...
http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html (25 July
2008) Accessed 10 January 2010.
4. Pizzuti, C.: Community detection in social networks with genetic algorithms. In:
GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolution-
ary computation, New York, NY, USA, ACM (2008) pp. 1137–1138
5. Lipczak, M., Milios, E.: Agglomerative genetic algorithm for clustering in social
networks. In: GECCO ’09: Proceedings of the 11th Annual conference on Genetic
and evolutionary computation, New York, NY, USA, ACM (2009) pp. 1243–1250
2
http://www.csee.umbc.edu/˜ypeng/BayesOWL/
7. 6. Picarougne, F., Monmarch, N., Oliver, A., Venturini, G.: Geniminer: Web mining
with a genetic-based (2002)
7. Xu, Y., Deli, Y., Yu, L.: Efficient annealing -inspired genetic algorithm for in-
formation retrieval from web-document. In: GEC ’09: Proceedings of the first
ACM/SIGEVO Summit on Genetic and Evolutionary Computation, New York,
NY, USA, ACM (2009) pp. 1017–1020
8. Figueroa, A.G., Neumann, G.: Genetic algorithms for data-driven web question
answering. Evolutionary Computation 16(1) (2008) pp. 89–125
9. Pal, S.K., Talwar, V., Mitra, P., Member, S., Member, S.: Web mining in soft
computing framework: Relevance, state of the art and future directions. IEEE
Transactions on Neural Networks 13 (2002) pp. 1163–1177
10. Shavlik, J., Towell, G.G.: Knowledge-based artificial neural networks. Artificial
Intelligence 70(1/2) (1994) pp. 119–165
11. Yang, J.J., Korfhage, R.R.: Query modification using genetic algorithms in vector
space models. International Journal of Expert Systems 7(2) (1994) pp. 165–191
12. Jung, J.J.: An evolutionary approach to query-sampling for heterogeneous systems.
Expert Systems with Applications 37(1) (2010) pp. 226–232
13. Jung, J.J.: Ontological framework based on contextual mediation for collaborative
information retrieval. Information Retrieval 10(1) (2007) pp. 85–109
14. Noy, N.F., Musen, M.A.: Prompt: Algorithm and tool for automated ontology
merging and alignment. In: Proceedings of the Seventeenth National Conference
on Artificial Intelligence and Twelfth Conference on Innovative Applications of
Artificial Intelligence, AAAI Press / The MIT Press (2000) pp. 450–455
15. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the
Seventeenth National Conference on Artificial Intelligence and Twelfth Conference
on Innovative Applications of Artificial Intelligence, AAAI Press / The MIT Press
(2000) pp. 577–583
16. Bikel, D.M., Schwartz, R., Weischedel, R.M.: An algorithm that learns what‘s in
a name. Machine Learning 34(1-3) (1999) pp. 211–231
17. Muggleton, S., ed.: Inductive Logic Programming. Academic Press, New York,
NY (1992)
18. Freitag, D.: Toward general-purpose learning for information extraction. In: Pro-
ceedings of the 17th international conference on Computational linguistics, Mor-
ristown, NJ, USA, Association for Computational Linguistics (1998) pp. 404–408
19. Polkowski, L., Skowron, A.: Rough mereology: A new paradigm for approximate
reasoning. Int. J. Approx. Reasoning 15(4) (1996) pp. 333–365
20. Fanizzi, N., d’Amato, C., Esposito, F.: Evolutionary conceptual clustering based
on induced pseudo-metrics. International Journal on Semantic Web & Information
Systems 4(3) (2008) pp. 44–67
21. Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Discovery and evaluation of aggre-
gate usage profiles for web personalization. Data Min. Knowl. Discov. 6(1) (2002)
pp. 61–82
22. Holi, M., Hyv¨nen, E. In: Modeling uncertainty in semantic web taxonomies.
o
Springer-Verlag, Berlin (2006)
23. Ding, Z., Peng, Y.: A probabilistic extension to ontology language owl. In: HICSS
’04: Proceedings of the Proceedings of the 37th Annual Hawaii International Con-
ference on System Sciences (HICSS’04) - Track 4, Washington, DC, USA, IEEE
Computer Society (2004) p. 40111.1
24. W3C Incubator Group Report: Uncertainty Reasoning for the World Wide Web.
http://www.w3.org/2005/Incubator/urw3/XGR-urw3/ (31 March 2008) Accessed
10 January 2010.