This document summarizes the Knowledge Engineering efforts for the TELDAP digital library project. It discusses (1) developing metadata models for different types of digital objects, including a union catalog model and models for websites and documents; (2) establishing hyperlinks between objects and keywords to connect related resources; and (3) constructing ontologies and thesauri like Getty AAT and Chinese WordNet to link keywords and establish implicit relationships between objects. The goal is to optimize access, retrieval and understanding of the large and growing collection of digital content.
1. Knowledge Engineering for
TELDAP
Keh-Jiann Chen
Principal Investigator
Core Platforms for Digital Contents Project, TELDAP
Research Fellow
Research Center for Information Technology Innovation &
Institute of Information Science, Academia Sinica
2. Outline
Introduction
Union catalog
Databases and metadata for digital
contents and websites
Knowledge engineering
Future perspective
3. Introduction
The integration and management of digital
contents has become an important issue as
the amount of digital contents produced from
different projects and institutions increases
rapidly.
Our project goal is to achieve optimized
preservation, retrieval, and presentation of
digital collections.
5. What is the union catalog¡H
It is a catalog and portal for all digital collections of
TELDAP.
It is an integrated platform for browsing and searching
entire digital contents of TELDAP.
Metadata provides core descriptions and licensing
information of each digital collection.
8. Metadata models for different types of
objects
Archived digital items
Union catalog metadata model- Dublin core+
Web sites
DCCAP (Dublin Core Collections Application Profile)
Fields for internal used only
Unique Identifier, Format, Evaluation, Cataloging History
Documents
Document metadata-Dublin core
9. Metadata for Element Definition
Title A name given to the resource
digital items¡G Creator An entity primarily responsible for making the
content of the resource
Subject and Keywords The topic of the content of the resource
Over 2 million Description An account of the content of the resource
Publisher An entity responsible for making the resource
digital items and available
An entity responsible for making contributions to the
Contributor
content of the resource
still increasing Date A date associated with an event in the life cycle of
the resource
Resource Type The nature or genre of the content of the resource
Format The physical or digital manifestation of the resource
Resource Identifier An unambiguous reference to the resource within a
given context
Source A Reference to a resource from which the present
resource is derived
Language A language of the intellectual content of the
resource
Relation A reference to a related resource
Coverage The extent or scope of the content of the resource
Rights Management Information about rights held in and over the
resource
9
11. Metadata for websites
Over 200 websites and still increasing
Metadata
DCCAP (Dublin Core Collections Application
Profile)
To Combine the standard with our requirements:
19 data fields
12. Metadata for websites
The Website Homepage Picture
URL, Project Information
Type, Name, Author, Subject,
Description, Language,
Item Type, Target
Archived Information:
URL, time, authorization
Copyright, Purpose, Other Information
Figure: http://digitalarchives.tw
13. Dynamic categorization
User-oriented categorization
General, elementary school students, high school
students, researchers, …etc.
Topical-based categorization
Archaeology, painting, animal, plant, document,
…etc.
Functional-based categorization
Research, education, business, technology,…
Categorization based on institutions
Academia Sinica, Taiwan U., Palace museum,…
14. Figure: http://digitalarchives.tw
Purpose: Education
Target: Elementary school student,
Junior high school student,
Teacher…
Select Items:
According to 40 evaluation
indicators, select top 5 websites
Purpose: Creative applications
Select Items:
According to 40 evaluation
indicators, select top 5 websites
Purpose: Academic research
Subject: Animal, Archaeology,
Anthropology…
Select Items:
According to 40 evaluation
indicators, select top 3 websites
15. Metadata for project documents
Over 5000 documents and still increasing
Metadata- Dublin core
Construct Teldapwiki- A Wikipedia for Teldap
http://wiki.teldap.tw/
17. Plans of making knowledge structures
for TELDAP
Construct metadata models for different objects.
Establish hyperlinks between contexts and
objects.
Develop keyword extraction tools.
Design automatic tagging tools.
Construct Teldap ontology and thesaurus
Art & Architecture Thesaurus by Getty
Chinese WordNet
18. (1) Metadata models for different objects
Digital collections
Union catalog metadata model- Dublin core+
Web sites
DCCAP (Dublin Core Collections Application Profile)
Public fields
Private fields
Unique Identifier, Format, Evaluation, Cataloging History
Documents
Document metadata-Dublin core
19. (2) Establish hyperlinks between contents and
objects
Identify keywords in contents
Tag keywords with related object hyperlinks
20. Develop hyperlink tagging tools
Word segmentation tools
Resolve word segmentation ambiguities and
identify keywords.
CKIP word segmentation system:
http://ckipsvr.iis.sinica.edu.tw/
21. Develop hyperlink tagging tools
TELDAP keyword dictionary
Extract keywords from metadata and establish
object-keyword relations.
Extract text from XML data for each object
The text are classified by topics, titles,
descriptions, authors, locations, eras etc.
From each class of text file extract keywords
by automatic word segmentation and keyword
extraction techniques.
22. Prototype system for hyperlink tagger
Identify and select keywords from the input text
26. (3) Construct Teldap ontology and thesaurus
Establish association links between
Chinese keywords and Getty AAT.
Merging Chinese WordNet with English
WordNet
27. Future Perspectives
Technology development
Construct multi-lingua thesauri – Getty AAT
Maintain the TELDAP keyword and object relation
database
Construct name authority files, gazetteers, and
universal calendars
Design hyperlink taggers and keyword extension tools
Designing authoring tool which provides hyperlinks of
keyword related digital contents automatically
Design knowledge-based content retrieval system
28. Future Perspectives
Content enrichment
Within TELDAP¡G
Standardize object metadata model and data format
All TELDAP objects should have their metadata
Writing scripts and stories for different topics with
Wiki-like knowledge structure
Enrich the digital collections
Establish hyperlinks between text books and
TELDAP collections
Extend the knowledge sources¡G e.g. Wikipedia