2. Tudhope
Semantically Indexed Hypermedia: Linking Information Disciplines
reasoning. Different types of indexing system are possible. It is useful to categorise indexing systems according to
three dimensions [van Rijsbergen 1979]:
1.
whether index terms are automatically derived or manually assigned.
2.
whether index terms belong to a controlled vocabulary or are uncontrolled ('free').
3.
whether terms can be combined as ordered strings representing a single concept when indexing (precoordinated terms), e.g. "Association of Computing Machinery", or must be post-coordinated on retrieval.
The latter allows the possibility of 'false positives' where items are returned that have no connection
between different terms in the source string.
Information Retrieval (IR) has tended towards automatically generated free text index terms (post-coordinated),
weighted by statistical frequency of terms in documents and collections. On the other hand, distinguishing features
of a semantic index are that semantic relationships exist between controlled index terms, usually (but not
necessarily) the result of manual cataloguing. Semantically indexed hypermedia links are, by definition, computed,
corresponding to Intensional-Retrieval links [DeRose 1989]. This allows the possibility of flexible query-based
navigation tools.
2 Thesauri and Classification Systems
The semantic index approach employs a set of semantic relationships between index terms, following the well
established thesaurus tradition in information science (ISO 2788, ISO 5964). A large number of thesauri exist,
covering a variety of subject domains, for example the Medical Subject Headings [MeSH 1999] and the Art and
Architecture Thesaurus [AAT 1999]. Classification systems, such as Dewey Decimal or Library of Congress, focus
on hierarchical relationships. These controlled vocabularies are part of standard cataloguing practice in libraries and
museums and are now being applied to digital hypertexts via thematic keywords in metadata resource descriptors.
For example, the Dublin Core [DC 1999] standard metadata set includes elements for Title, Creator, Date, Format,
etc. in addition to the more complex notion of the Subject (or theme) of a resource. Guidelines recommend that,
where possible, the Subject element be taken from a relevant controlled vocabulary. Links between concepts in the
subject domain can be expressed by the semantic relationships in a thesaurus. The three main thesaurus relationships
are Equivalence (equivalent terms), Hierarchical (broader/narrower terms), and Associative (more loosely Related
Terms). Sometimes specialisations of the three main relationships are included (for example distinguishing
taxonomic and instance hierarchical relationships). Following a minimalist approach to semantic modelling by
restricting the set of relationships permits interoperability of cataloguing/retrieval tools and techniques. It also
facilitates automated reasoning over this core set of relationships.
3 Using semantic index links
Navigation is provided indirectly by queries to the semantic index space, as opposed to directly following explicit
links between information items. The queries can be simple or complex. The conventional hypermedia navigation
techniques may be implemented by relatively simple queries [Tudhope 1994], although there would be no particular
reason to use a semantic index to achieve that functionality. One additional possibility provided by a semantic index
space is an organised set of browsable concept descriptors, as a means of comprehending the associated layer of
media items [Bruza 1990], [Pollard 1993]. The user can browse the index space, 'beam down' to view media items of
interest, and conversely 'beam up' to the index space from media items. Additionally, when index terms are
combined, the user may browse around each term, broadening and narrowing the specificity of description and
seeing the effect on likely 'hits' [Pollitt 1997]. Alternatively, the combined terms can be considered as locating a
position in a 'hyperindex', permitting a string of terms to be broadened or narrowed in one navigation action [Bruza
1990]. If a user enters a set of query terms as opposed to browsing the index space, equivalence relationships permit
a broad entry vocabulary of synonyms to be tied together for retrieval purposes, without the user having to specify
ACM Computing Surveys, Vol. 31, Number 4es, December 1999
2
3. Tudhope
Semantically Indexed Hypermedia: Linking Information Disciplines
the exact term employed for indexing. As a simple example, this document is indexed by a set of controlled
vocabulary terms from the ACM Computing Classification [ACM 1998] (see Categories and Subject Descriptors
above). In the ACM Digital Library pages, explicit hypertext links can be navigated. In addition, controlled
vocabulary index terms can be combined with free text terms when searching the library and the hypertext version
of the classification can be browsed as a subject index in order to select terms for searching.
Beyond this, the inclusion of semantic information in the index space provides the opportunity for knowledge-based
hypermedia systems that provide intelligent navigation support and retrieval, with the system taking a more active
role in the navigation process than relying on manual browsing alone. For example, rules governing permitted
combinations of terms can filter a user's possible navigation options [Arents 1993], [Rada 1993]. Work at the
University of Glamorgan explores the potential of reasoning over the semantic relationships in the index space.
Traversal of transitive relationships makes possible imprecise matching between query and media item, or between
two media items, rather than relying on an exact match of controlled vocabulary terms [Tudhope 1997]. Expanding
terms offers an augmented browsing capacity based on measures of distance in the semantic index space. Results
can be post-processed for expression in a particular retrieval tool. Various possibilities exist for indirect computed
links with such hybrid query/navigation tools [Cunliffe 1997]. For example, information items with semantically
close terms can be ranked in the result or destination set, or the system might automatically suggest terms to be
considered for inclusion in a query. If facets exist for time and place in the index space, then a result set can be
returned as a dynamic guided tour based on temporal or spatial relationships (or indeed other orderings).
Alternatively, the focus of a user's navigation can remain in the document (media) space, typically requiring less
cognitive overhead than constructing a formal query [Marchionini 1995]. In this case, having found an information
item of interest, the navigation action consists of requesting "More items like this one", with the system responsible
for a (best-match) similarity measure of the item's index terms. At the cost of greater cognitive demand on the user,
the source context for the navigation may be modified and particular media items or terms (de)emphasised (cf.
relevance feedback techniques in IR).
4 Key application to RDF and the WWW
Semantically based retrieval underpins diverse efforts to provide access to distributed multimedia resources, such as
the many projects involving SGML (XML) and Z39.50 for networked access to cross-platform information. Major
efforts are underway to create subject-based gateways to Internet resources, sometimes combining manually indexed
and robot harvested metadata. The W3C Recommendation for a 'machine-understandable' Resource Description
Framework supports the thrust of this research [Lassila 1999]. An RDF descriptor might include the Dublin Core
element, Subject, specifying a classification or thesaurus to which keywords belong. Precise semantic index retrieval
tools will be required to provide a manageable set of results to requests that may span several collections [Doerr
1997], and may involve networked terminology servers and more than one thesaurus or classification. One point
worth emphasising is the social dimension to access and the link with existing cataloguing practice. Controlled
vocabularies are often the result of standards efforts in subject domains, continue to evolve, and are part of a
network of practice and education/training in the information science community. They have the potential to act as a
bridge between information provider and seeker, "a semantic road map for searchers and indexers" [Soergel 1995],
if tools can be devised that visualise their structure and how they may be used.
5 Research issues
A number of key issues for research remain if the potential of significant gains in precision of information access is
to be realised.
•
An advantage of building query functionality into hypertext navigation is a smooth transition between
querying and browsing. Can we identify the appropriate extent of cognitive effort demanded by interfaces
to navigation tools? How far should the internal workings of matching functions or the detail of the
underlying semantic network be brought to the surface?
ACM Computing Surveys, Vol. 31, Number 4es, December 1999
3
4. Tudhope
Semantically Indexed Hypermedia: Linking Information Disciplines
•
Some applications may lend themselves to the specialisation of the standard thesaurus relationships into
richer sets, particularly the associative relationship. For example, in some situations it may be useful to
distinguish various kinds of causal relationships from the generic associative relationship.
•
The problem of expressing similarity between pre-coordinated strings of semantic index terms needs
further investigation. How much should be pre-computed and what can be left to dynamic computation?
How best can we express syntax or structure in such strings? This effort converges with work on
description logic ontologies [Bullock 1998], [Weinstein 1998].
•
Various efforts attempt to combine statistical IR and semantic controlled vocabulary approaches. For
example, Agosti et al [Agosti 1995] propose a three layer architecture for Hypermedia IR systems
combining a statistical index layer and a semantic (thesaurus) layer (see also [Aslandogan 1997],
[Chiaramella 1996]). Studies of online searching behaviour have investigated conditions influencing choice
of free text or controlled vocabulary terms (e.g. [Fidel 1991]). How should the two approaches be best
integrated - should they be seen as different components of a toolkit, or should a matching function
incorporate both statistical weighting and semantic measures? In addition, indirect semantic links and
explicit authored links will soon be combined in link/search engines. What principles should guide this
integration?
•
The semantic interoperability of overlapping but different thesauri is an important issue for remote access
to distributed sets of resources employing controlled vocabularies in metadata. A concept may exist in one
vocabulary but not another, or may map (partially) to various concepts.
References
[AAT 1999] Art and Architecture Thesaurus Browser, [Online: http://shiva.pub.getty.edu/aat_browser/], 1999.
[ACM 1998] ACM Computing Classification. http://www.acm.org/class/1998/
[Agosti 1995] Maristella Agosti, Massimo Melucci, and Fabio Crestani. "Automatic Authoring and Construction of
Hypermedia for Information Retrieval" in ACM Multimedia Systems, 3(1), 15-24, 1995.
[Arents 1993] Hans C. Arents and Walter F. L. Bogaerts. "Navigation without Links and Nodes without Contents:
Intensional Navigation in a Third-Order Hypermedia System" in Hypermedia, 5(3), 187-204, 1993.
[Aslandogan 1997] Y. Alp Aslandogan, Chuck Thier, Clement T. Yu, Jon Zou, and Naphtali Rishe. "Using
Semantic Contents and WordNet in Image Retrieval" in Proceedings of ACM SIGIR '97, 286-295, 1997.
[Berners-Lee 1998a] Tim Berners-Lee. World Wide Web Design Issues: A Roadmap to the Semantic Web,
[Online: http://www.w3.org/DesignIssues/Semantic.html], 1998.
[Bruza 1990] Peter Bruza. "Hyperindices: A Novel Aid for Searching in Hypermedia" in Proceedings of the ACM
European Conference on Hypertext '90 (ECHT '90), Versailles, France,109-122, November 1990.
[Bullock 1998] Joseph Bullock and Carole Goble. "TourisT: The Application of a Description Logic based
Semantic Hypermedia System for Tourism" in Proceedings of ACM Hypertext '98, Pittsburgh PA, 132-141, June
1998.
[Chiaramella 1996] Yves Chiaramella and Ammar Kheirbek. "An Integrated Model for Hypermedia and
Information Retrieval" in Information Retrieval and Hypertext, Maristella Agosti and Alan Smeaton (editors),
Kluwer, 139-178, 1996.
ACM Computing Surveys, Vol. 31, Number 4es, December 1999
4
5. Tudhope
Semantically Indexed Hypermedia: Linking Information Disciplines
[Collier 1987] George Collier. "Thoth-II: Hypertext with Explicit Semantics" in Proceedings of ACM Hypertext
'87, Chapel Hill, NC, 269-289, November 1987.
[Cunliffe 1997] Daniel Cunliffe, Carl Taylor, and Douglas Tudhope. "Query-based Navigation in Semantically
Indexed Hypermedia" in Proceedings of ACM Hypertext 97, Southampton, UK, 87-95, April 1997.
[DC 1999] Dublin Core. [Online: http://purl.org/metadata/dublin_core], 1999.
[DeRose 1989] Steven J. DeRose. "Expanding the Notion of Links" in Proceedings of ACM Hypertext '89,
Pittsburgh, PA, 249-257, November 1989.
[Doerr 1997] Martin Doerr, Irene Fundulaki and Vassilis Christophidis. "The Specialist Seeks Expert Views:
Managing Digital Folders in the AQUARELLE Project" in Proceedings of Museums and the Web, David Bearman
and Jennifer Trant (editors), 261-270, 1997.
[Fidel 1991] Raya Fidel. "Searchers' Selection of Search Keys (I-III)" in Journal of American Society for
Information Science, 42(7), 490-527, 1991.
[Frisse 1989] Mark E. Frisse and Steven B. Cousins. "Information retrieval from hypertext: Update on the Dynamic
Medical Handbook" in Proceedings of ACM Hypertext '89, Pittsburgh, PA, 199-211, November 1989.
[Lassila 1999] Ora Lassila and Ralph Swick (editors), "Resource Description Framework (RDF) Model and Syntax
Specification" World Wide Web Consortium Recommendation, [Online: http://www.w3.org/TR/REC-rdf-syntax/],
February 22 1999.
[Marchionini 1995] Gary Marchionini. Information Seeking in Electronic Environments. Cambridge University
Press, 1995.
[MeSH 1999] MeSH 1999. Medical Subject Headings homepage. http://www.nlm.nih.gov/mesh/meshhome.html
[Nanard 1991] Jocelyne Nanard and Mark Nanard. "Using structured types to incorporate knowledge in hypertext"
in Proceedings of ACM Hypertext '91, San Antonio, TX, 329-344, December 1991.
[Pollard 1993] Richard Pollard. "A hypertext-based thesaurus as a subject browsing aid for bibliographic databases"
in Information Processing and Management, 29(3), 345-357, 1993.
[Pollitt 1997] Steven Pollitt, Martin P Smith and Patrick A J Braekevelt. "View-based Searching Systems" in
Proceedings of Joint Workshop of BCS IR and HCI Specialist Groups, (Johnson and Dunlop eds.) 73-77.
[Rada 1993] Roy Rada, Weigang Wang, Alex Birchall. "Retrieval hierarchies in hypertext" in Information
Processing and Management 29(3), 359-371, 1993.
[Schnase 1993] John L. Schnase, John J. Leggett, David L. Hicks, and Ron L. Szabo. "Semantic Data Modeling of
Hypermedia Associations. ACM Transactions on Information Systems (TOIS), 11(1), 27-49, January 1993.
[Soergel 1995] Dagobert Soergel. "The Art and Architecture Thesaurus (AAT): a critical appraisal" in Visual
Resources, 10(4), 369-400, 1995.
[Trigg 1986] Randall H. Trigg and Mark Weiser. "Textnet: A Network-based Approach to Text Handling" in ACM
Transactions on Office Information Systems (TOIS), 4(1), 1-23, January 1986.
[Tudhope 1994] Douglas Tudhope, Paul Beynon-Davies, Carl Taylor, and Chris B. Jones. "Virtual Architecture
Based on a Binary Relational Model: A Museum Hypermedia Application" in Hypermedia, 6(3), 174-192, 1994.
ACM Computing Surveys, Vol. 31, Number 4es, December 1999
5
6. Tudhope
Semantically Indexed Hypermedia: Linking Information Disciplines
[Tudhope 1997] Douglas Tudhope and Carl Taylor. "Navigation via Similarity: Automatic Lnking Based on
Semantic Closeness" in Information Processing and Management, 33(2), 233-242, 1997.
[van Rijsbergen 1979] C. J. "Keith" van Rijsbergen. Information Retrieval. Butterworth, 1979.
[Weinstein 1998] Peter C. Weinstein. "Ontology-based metadata: transforming the MARC legacy" in Proceedings
of ACM Digital Libraries '98, 254-263, 1998.
ACM Computing Surveys, Vol. 31, Number 4es, December 1999
6