Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"

Improving Discovery Systems Through
Post Processing of Harvested Data.
NISO Discovery Tools ForumNISO Discovery Tools Forum
March 26-27, Chapel Hill, NCMarch 26-27, Chapel Hill, NC
Vinod Chachra, PhDVinod Chachra, PhD
CEO, VTLS Inc.CEO, VTLS Inc.

About VTLS
VTLS is the first spin-off corporation from Virginia Tech
VTLS has been in this industry for over 30 years
VTLS has agents and offices in countries around the world. VTLS
does business over 40 countries
VTLS Products --has four major product lines
Virtua – Integrated Library System
VITAL – Fedora based Institutional Repository – developed partly
in partnership with the Australian ARROW project.
VTRAX – RFID based tracking & security systems for libraries
Visualizer – Discovery Tool and Discovery Service

VTLS HQ in Blacksburg, VA, USA

Opening Keynote
Richard Ackerman’s presentation – two examples
Unmanned vehicles navigating urban environments
Virginia Tech in 3rd in that contest this year behind CMU and
Stanford
Uppsala Project (400th anniversary)
Uppsala University in Sweden
VTLS user for more than 15 years

Environment Assumptions
For the purposes of this presentation we will assume that the
discovery environment we work in is:
Global
Distributed
Multi-lingual
Contains different types of systems -- OPACS, repositories,
streaming media…

A good example: NDLTD
NDLTD
Networked Digital Library of Theses and Dissertations
www.ndltd.org
Data for the Discovery system was harvested
From 6 continents
In 30 languages
From about 70 institutions
Using several primary and secondary aggregators
harvesting about 380,000 items (records)

Players
Data Providers
Service Providers (Discovery Service Provider)
Aggregators
Enrichment Information Providers
Knowledge Base Providers
Users

Processes
Harvesting the data
Duplicate Control
Aggregation of data
Profiling of systems
Enhancing the metadata with external content
Creating a service (Discovery Service)
Updating the data (Back to Harvesting)

Definitions you already know (1 of 2)
The following definitions come from OAI-PMH web site:
Harvesting (From: http://www.openarchives.org)
In the OAI context, harvesting refers specifically to the gathering together of
metadata from a number of distributed repositories into a combined data store.
Data Provider (From: http://www.openarchives.org)
A Data Provider maintains one or more repositories (web servers) that support the
OAI-PMH as a means of exposing metadata.
(OAI definition quoted from FAQ on OAI Web site)
Service Provider (From: http://www.openarchives.org)
A Service Provider issues OAI-PMH requests to data providers and uses the
metadata as a basis for building value-added services.
A Service Provider in this manner is "harvesting" the metadata exposed by Data
Providers

Definitions you already know (2 of 2)
Aggregator (From: http://www.openarchives.org)
An OAI aggregator is both a Service Provider and a Data Provider. It is a service
that gathers metadata records from multiple Data Providers and then makes those
records available for gathering by others using the OAI-PMH.
Enrichment Information Provider
A service provider that provides additional descriptive, images or streaming media
information related to any displayed result set making it more attractive or useful to
the user.
Knowledge Base Provider
More on this later
User
YOU, ME

On Harvesting Economies
During the March 6, 2008 meeting of the DLF group in
Berkeley, CA it was reaffirmed that the OAI-PMH protocol
was the preferred method of acquiring data for discovery
systems.
Frequent harvesting of the entire content of data providers
should be discouraged. It is preferred that selective or
incremental harvesting be used.
Aggregators should be used where possible to decrease the
overall system load

Example: NDLTD [www.ndltd.org]

Discovery System Characteristics: Example 1
Example 1: http://rogers.vtls.com:6080/visualizer
System characteristics--
Fast
Support distributed content
“Push” information to user
No initial search required
Re-compute all facets for each result set
Act like a union catalog
Allow individual library branding
Support communities and collections
Support profiling
Enhance discovery through “Knowledge Base”
Support drill down capability

Discovery System Characteristics: Example 2
Example 2: http://rogers.vtls.com:7080/visualizer
System characteristics
Fast
Support distributed content
“Push” information to user
No initial search required
Re-compute all facets for each result set
Act like a union catalog
Allow individual library branding
Support communities and collections
Support profiling
Enhance discovery through “Knowledge Base”
Support drill down capability

Coded Data
Code Lists
Language
Content
Format
Target Audience
Etc.
String Lists
Circa; A.D., B.C.
Abbreviations – MESH, LCSH, ALA
Associations
If 2nd indicator has a value “x” then it means “Y”.
If 2nd indicator has a value “z” then look at $2

Coded Data
Code Lists do not use the same code for different sources
2 character or 3 character language abbreviations
Difference in coding between Marc21 and UniMarc
Codes are not known to most users
Requires a knowledge base to decode.
Coded data does reduce the size of the data during
harvesting but requires complex and sometimes impossible
decoding at the service provider end.

Who should decode the data?
Data Provider?
It appears that the data provider is best quipped to do
so. It is their data and therefore they have the greatest
potential of doing it correctly. But there are issues
Language : Example – target audience “juvenile” in 30
different languages poses more problems for the service
providers than simply decoding the data upon receipt.
Timeliness (takes too long to co-ordinate lots of data
providers)
Understanding -- Service provider has a better
understanding of the global picture and the needs of the
service. It is easier to support multilingual searching and
multilingual display if the data is coded for use by the
search engine.

Who should decode the data?
Another view of coding data
Codes for language = French are
Sometimes “fr”
Sometimes “fre”
Why not make the code a string like “French”.
Think about it !
In my opinion:
In today’s environment -- it is better for the service
provider (or the aggregator as a service provider) to decode
the data. It provides for better consistency.
In the long run -- better for the data provider to do so.

Codes in Classification Systems
Dewey Decimal system
LC Classification system
Do we really need to show the classification system to the
users? Why not display only the strings?
Easier for users
Easier for switching languages
Allows for meaningful drill down capabilities.
See example

Characteristics of Facets
Modern discovery tools are facet based
Entries in Facets can be
One to One : Each metadata record has exactly one entry in
the facet (like date of publication; language; location)
One to many: Each metadata record can have more than
one entry in the facet table (like subjects, authors, etc)
Drill down: A hierarchically arranged facet that the user can
drill down on.

Characteristics of Facets
LC Subject;
Subject
subdivisions
Mime Type
in complex
objects
One to Many
Date; LC Class;
Dewey; Place of
Publication
Language;
Content;
target
audience
One to One
Drill Down
Facets
No Drill
Down
Facets
Examples

Consistency Considerations
Number is the result set MUST equal the sum of hits in
all one-to-one facets.
Need a method to handle “missing data” or “not available”.
Added benefit of identifying the records with “missing” data.
Also applies to one-to-one drill down facet data.
See example
When the scope of the discovery system includes data
from sources that use multiple classification systems
then how do we aggregate this information? Or do we
simply let the chips fall where they may?
See example

Normalization of Data
In support of duplication control
Capitalization
Handling of Diacritics
Ending punctuation
Embedded punctuation (ISBD punctuation)
Misspelled terms
Standard forms based on Authority Control systems
Normalization for sorting purposes
(US sort versus European sort or Swedish sort)

Harvested Data Formats
MARC21
UniMarc
Dublin Core
Qualified Dublin Core
MODS
METS
EAD
LAP (Local Application Profile)
Numerous crosswalks will be needed between them
Unless each data provider can support the harvesting of
data in several agreed upon formats.
Crosswalks by data provider (for some subset of formats)

Harvested Data Formats
Data providers should provide data to service providers
through OAI-PMH in (at a minimum)
Some internationally acceptable format
In a published local format
To convey the richness of the content
To permit deep linking for objects that are complex
objects in the collection

Enrichment Process
Well understood process
They supply additional useful information in support of
the selection process
Images – including book covers; streaming video;
TOC; Author notes etc.
Several suppliers of this information
Syndetics
Video Detective
Google
Amazon
OCLC
More suppliers will come

Use of Knowledge Base
Knowledge base may be
Global (applies to all data providers in a class)
All data providers using MARC 21 based metadata
All data providers using DC data
All data providers using XMLAP (XML based
application profile) or LAP (Local Application Profile)
Local (applies to specific data providers)
Institutions
Locations within institutions
Cantons and other location categories.

Examples of Global Knowledge Base
LC Classification (conversion of codes to strings)
Dewey Classification (conversion of codes to strings)
State/Country>> Country>> Continent Aggregation
Format (Data from 006 and 008)
Language code to text (008 positions 35 – 37 to text)
Target Audience (008 position 22 to text)
Contents (008 value to text)

Local Knowledge Base
Location names and codes
Institutions names and codes
Main Locations and Sub Locations
Source of content
From Library OPAC
From Institutional Repositories
From Government Documents
From State Legal Documents
External Link to Content (enrichment process)

FRBR – UCL Catalog Examples
Owned By Field

FRBR – UCL Catalog Examples
Owned By
Multiple
Institutions

FRBR
There are two approaches to the implementation to FRBR
Store the internal data in a hierarchic linked record format.
FRBRize records upon display keeping the storage system
like a traditional flat catalog
Since records are cataloged only once and displayed many
many times it is better to use the first method.
When harvesting FRBR records do we unFRBRize them
and then harvest? Or do we harvest and them and then do
some post processing?
Issue remains unresolved.

Duplicate Control -- De-Duping FRBR
For a discovery tool used in a multi-institutional
environment should a record representing the same
manifestation (work + expression + manifestation in FRBR
terminology) be repeated if it comes form multiple
institutions or should it be de-duped?
If it is de-duped then what do we do if one of the
institutions has the record as a FRBR record and the other
does not?

Handling of Names and Subjects
Developing solutions using URIs
SKOS – Simple Knowledge Organization System
According to Roy Styles and others, in their article
presently in draft form, called “Semantic MARC, MARC21
and the Semantic Web”
LC is working towards releasing their subject
headings on the web using SKOS.
If this happens, it will make possible several
advances in creating web based linked authorities.
The possibilities are immense but what do we do now
for name and subject authorities in discovery systems.
See example

Branding and Drill Down (3 of 4)

Branding and Expanded Search (4 of 4)

How Many Facets?
Basic facets and Extended facets
Basic facets -- minimal set for every implementation
Extended facets – additional facets for special use
How many facets?
Too few facets are ineffective
Too many facets are not user friendly
How to identify
One to one facets
One to many facets
Drill down facets

How does it work?
1. Harvest data: OAI-PMH used for harvesting the metadata
2. Create KB: Apply the “Knowledge Base” to the Metadata
3. Profile the system for proper facets
Facets on the raw data
Facets on the derived data from knowledge base
4. Create standardized input for indexing
5. Apply indexing for use by search engine
6. Throw away the harvested metadata but retain index
7. Discover
8. Hyperlink to the source for display of content

Visualizer -- Expanding the Architecture
The problem is massive.
We can admit defeat or build systems to cope with it.
These systems must take advantage of the capabilities of
the computer and combine it with the knowledge,
expertise and inference ability of humans.
Two Questions
How do you organize the world’s information?
How do we visualize the nature and depth of content?
VTLS Visualizer OPAC -- Facet Based Searching
ILS 1,2,3 Repository 1,2 ANY SYSTEM
MAPPING ROUTINES
MARC 21
STANDARDIZED INPUT
FACET SEARCH ENGINE
STANDARIZED OUTPUT
DISPLAY MANAGEMENT
D.C.
XML
Direct Direct
Profiling
Interface
Query
Interface
Knowledge Base

Conclusions
1. A lot of progress has been made in discovery systems and
they have paved the way to a new future.
2. Much remains to be done.
3. Work by DLF under the leadership of John Mark Ockerbloom
will set the direction for a considered approach in the
development and deployment of discovery systems.
4. In the meantime, Data providers and Service providers need
to get together and agree on
Who does what?
How can we make the systems consistent?
How can we do this in a multilingual environment?
How can we bring together different sources of
information under a single discovery system; in other
words how can we best help the users.

Questions
NISO Discovery Tools ForumNISO Discovery Tools Forum
March 26-27, Chapel Hill, NCMarch 26-27, Chapel Hill, NC

Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"

Ähnlich wie Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data" (20)

Mehr von National Information Standards Organization (NISO)

Mehr von National Information Standards Organization (NISO) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"