SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
Improving Discovery Systems Through
Post Processing of Harvested Data.
NISO Discovery Tools ForumNISO Discovery Tools Forum
March 26-27, Chapel Hill, NCMarch 26-27, Chapel Hill, NC
Vinod Chachra, PhDVinod Chachra, PhD
CEO, VTLS Inc.CEO, VTLS Inc.
About VTLS
VTLS is the first spin-off corporation from Virginia Tech
VTLS has been in this industry for over 30 years
VTLS has agents and offices in countries around the world. VTLS
does business over 40 countries
VTLS Products --has four major product lines
Virtua – Integrated Library System
VITAL – Fedora based Institutional Repository – developed partly
in partnership with the Australian ARROW project.
VTRAX – RFID based tracking & security systems for libraries
Visualizer – Discovery Tool and Discovery Service
VTLS HQ in Blacksburg, VA, USA
Opening Keynote
Richard Ackerman’s presentation – two examples
Unmanned vehicles navigating urban environments
Virginia Tech in 3rd in that contest this year behind CMU and
Stanford
Uppsala Project (400th anniversary)
Uppsala University in Sweden
VTLS user for more than 15 years
Harvesting
Discovery through Probing
Environment Assumptions
For the purposes of this presentation we will assume that the
discovery environment we work in is:
Global
Distributed
Multi-lingual
Contains different types of systems -- OPACS, repositories,
streaming media

A good example: NDLTD
NDLTD
Networked Digital Library of Theses and Dissertations
www.ndltd.org
Data for the Discovery system was harvested
From 6 continents
In 30 languages
From about 70 institutions
Using several primary and secondary aggregators
harvesting about 380,000 items (records)
Players
Data Providers
Service Providers (Discovery Service Provider)
Aggregators
Enrichment Information Providers
Knowledge Base Providers
Users
Processes
Harvesting the data
Duplicate Control
Aggregation of data
Profiling of systems
Enhancing the metadata with external content
Creating a service (Discovery Service)
Updating the data (Back to Harvesting)
Definitions you already know (1 of 2)
The following definitions come from OAI-PMH web site:
Harvesting (From: http://www.openarchives.org)
In the OAI context, harvesting refers specifically to the gathering together of
metadata from a number of distributed repositories into a combined data store.
Data Provider (From: http://www.openarchives.org)
A Data Provider maintains one or more repositories (web servers) that support the
OAI-PMH as a means of exposing metadata.
(OAI definition quoted from FAQ on OAI Web site)
Service Provider (From: http://www.openarchives.org)
A Service Provider issues OAI-PMH requests to data providers and uses the
metadata as a basis for building value-added services.
A Service Provider in this manner is "harvesting" the metadata exposed by Data
Providers
Definitions you already know (2 of 2)
Aggregator (From: http://www.openarchives.org)
An OAI aggregator is both a Service Provider and a Data Provider. It is a service
that gathers metadata records from multiple Data Providers and then makes those
records available for gathering by others using the OAI-PMH.
Enrichment Information Provider
A service provider that provides additional descriptive, images or streaming media
information related to any displayed result set making it more attractive or useful to
the user.
Knowledge Base Provider
More on this later
User
YOU, ME
On Harvesting Economies
During the March 6, 2008 meeting of the DLF group in
Berkeley, CA it was reaffirmed that the OAI-PMH protocol
was the preferred method of acquiring data for discovery
systems.
Frequent harvesting of the entire content of data providers
should be discouraged. It is preferred that selective or
incremental harvesting be used.
Aggregators should be used where possible to decrease the
overall system load
Example: NDLTD [www.ndltd.org]
Discovery System Characteristics: Example 1
Example 1: http://rogers.vtls.com:6080/visualizer
System characteristics--
Fast
Support distributed content
“Push” information to user
No initial search required
Re-compute all facets for each result set
Act like a union catalog
Allow individual library branding
Support communities and collections
Support profiling
Enhance discovery through “Knowledge Base”
Support drill down capability
www.geonames.org
Discovery System Characteristics: Example 2
Example 2: http://rogers.vtls.com:7080/visualizer
System characteristics
Fast
Support distributed content
“Push” information to user
No initial search required
Re-compute all facets for each result set
Act like a union catalog
Allow individual library branding
Support communities and collections
Support profiling
Enhance discovery through “Knowledge Base”
Support drill down capability
Communities
Classifications
Branding
Branding
Drill Down
Coded Data
Code Lists
Language
Content
Format
Target Audience
Etc.
String Lists
Circa; A.D., B.C.
Abbreviations – MESH, LCSH, ALA
Associations
If 2nd indicator has a value “x” then it means “Y”.
If 2nd indicator has a value “z” then look at $2
Coded Data
Code Lists do not use the same code for different sources
2 character or 3 character language abbreviations
Difference in coding between Marc21 and UniMarc
Codes are not known to most users
Requires a knowledge base to decode.
Coded data does reduce the size of the data during
harvesting but requires complex and sometimes impossible
decoding at the service provider end.
Who should decode the data?
Data Provider?
It appears that the data provider is best quipped to do
so. It is their data and therefore they have the greatest
potential of doing it correctly. But there are issues
Language : Example – target audience “juvenile” in 30
different languages poses more problems for the service
providers than simply decoding the data upon receipt.
Timeliness (takes too long to co-ordinate lots of data
providers)
Understanding -- Service provider has a better
understanding of the global picture and the needs of the
service. It is easier to support multilingual searching and
multilingual display if the data is coded for use by the
search engine.
Who should decode the data?
Another view of coding data
Codes for language = French are
Sometimes “fr”
Sometimes “fre”
Why not make the code a string like “French”.
Think about it !
In my opinion:
In today’s environment -- it is better for the service
provider (or the aggregator as a service provider) to decode
the data. It provides for better consistency.
In the long run -- better for the data provider to do so.
Codes in Classification Systems
Dewey Decimal system
LC Classification system
Do we really need to show the classification system to the
users? Why not display only the strings?
Easier for users
Easier for switching languages
Allows for meaningful drill down capabilities.
See example
Characteristics of Facets
Modern discovery tools are facet based
Entries in Facets can be
One to One : Each metadata record has exactly one entry in
the facet (like date of publication; language; location)
One to many: Each metadata record can have more than
one entry in the facet table (like subjects, authors, etc)
Drill down: A hierarchically arranged facet that the user can
drill down on.
Characteristics of Facets
LC Subject;
Subject
subdivisions
Mime Type
in complex
objects
One to Many
Date; LC Class;
Dewey; Place of
Publication
Language;
Content;
target
audience
One to One
Drill Down
Facets
No Drill
Down
Facets
Examples
Consistency Considerations
Number is the result set MUST equal the sum of hits in
all one-to-one facets.
Need a method to handle “missing data” or “not available”.
Added benefit of identifying the records with “missing” data.
Also applies to one-to-one drill down facet data.
See example
When the scope of the discovery system includes data
from sources that use multiple classification systems
then how do we aggregate this information? Or do we
simply let the chips fall where they may?
See example
Normalization of Data
In support of duplication control
Capitalization
Handling of Diacritics
Ending punctuation
Embedded punctuation (ISBD punctuation)
Misspelled terms
Standard forms based on Authority Control systems
Normalization for sorting purposes
(US sort versus European sort or Swedish sort)
Harvested Data Formats
MARC21
UniMarc
Dublin Core
Qualified Dublin Core
MODS
METS
EAD
LAP (Local Application Profile)
Numerous crosswalks will be needed between them
Unless each data provider can support the harvesting of
data in several agreed upon formats.
Crosswalks by data provider (for some subset of formats)
Harvested Data Formats
Data providers should provide data to service providers
through OAI-PMH in (at a minimum)
Some internationally acceptable format
In a published local format
To convey the richness of the content
To permit deep linking for objects that are complex
objects in the collection
Enrichment Process
Well understood process
They supply additional useful information in support of
the selection process
Images – including book covers; streaming video;
TOC; Author notes etc.
Several suppliers of this information
Syndetics
Video Detective
Google
Amazon
OCLC
More suppliers will come
Use of Knowledge Base
Knowledge base may be
Global (applies to all data providers in a class)
All data providers using MARC 21 based metadata
All data providers using DC data
All data providers using XMLAP (XML based
application profile) or LAP (Local Application Profile)
Local (applies to specific data providers)
Institutions
Locations within institutions
Cantons and other location categories.
Use of Knowledge Base
Examples of Global Knowledge Base
LC Classification (conversion of codes to strings)
Dewey Classification (conversion of codes to strings)
State/Country>> Country>> Continent Aggregation
Format (Data from 006 and 008)
Language code to text (008 positions 35 – 37 to text)
Target Audience (008 position 22 to text)
Contents (008 value to text)
Use of Knowledge Base
Local Knowledge Base
Location names and codes
Institutions names and codes
Main Locations and Sub Locations
Source of content
From Library OPAC
From Institutional Repositories
From Government Documents
From State Legal Documents
External Link to Content (enrichment process)
FRBR – UCL Catalog
FRBR – UCL Catalog Examples
FRBR – UCL Catalog Examples
Owned By Field
FRBR – UCL Catalog Examples
Owned By
Multiple
Institutions
FRBR
There are two approaches to the implementation to FRBR
Store the internal data in a hierarchic linked record format.
FRBRize records upon display keeping the storage system
like a traditional flat catalog
Since records are cataloged only once and displayed many
many times it is better to use the first method.
When harvesting FRBR records do we unFRBRize them
and then harvest? Or do we harvest and them and then do
some post processing?
Issue remains unresolved.
Duplicate Control -- De-Duping FRBR
For a discovery tool used in a multi-institutional
environment should a record representing the same
manifestation (work + expression + manifestation in FRBR
terminology) be repeated if it comes form multiple
institutions or should it be de-duped?
If it is de-duped then what do we do if one of the
institutions has the record as a FRBR record and the other
does not?
Handling of Names and Subjects
Developing solutions using URIs
SKOS – Simple Knowledge Organization System
According to Roy Styles and others, in their article
presently in draft form, called “Semantic MARC, MARC21
and the Semantic Web”
LC is working towards releasing their subject
headings on the web using SKOS.
If this happens, it will make possible several
advances in creating web based linked authorities.
The possibilities are immense but what do we do now
for name and subject authorities in discovery systems.
See example
Branding (1 of 4)
Branding (2 of 4)
Branding and Drill Down (3 of 4)
Branding and Expanded Search (4 of 4)
How Many Facets?
Basic facets and Extended facets
Basic facets -- minimal set for every implementation
Extended facets – additional facets for special use
How many facets?
Too few facets are ineffective
Too many facets are not user friendly
How to identify
One to one facets
One to many facets
Drill down facets
How does it work?
1. Harvest data: OAI-PMH used for harvesting the metadata
2. Create KB: Apply the “Knowledge Base” to the Metadata
3. Profile the system for proper facets
Facets on the raw data
Facets on the derived data from knowledge base
4. Create standardized input for indexing
5. Apply indexing for use by search engine
6. Throw away the harvested metadata but retain index
7. Discover
8. Hyperlink to the source for display of content
Visualizer -- Expanding the Architecture
The problem is massive.
We can admit defeat or build systems to cope with it.
These systems must take advantage of the capabilities of
the computer and combine it with the knowledge,
expertise and inference ability of humans.
Two Questions
How do you organize the world’s information?
How do we visualize the nature and depth of content?
VTLS Visualizer OPAC -- Facet Based Searching
ILS 1,2,3 Repository 1,2 ANY SYSTEM
MAPPING ROUTINES
MARC 21
STANDARDIZED INPUT
FACET SEARCH ENGINE
STANDARIZED OUTPUT
DISPLAY MANAGEMENT
D.C.
XML
Direct Direct
Profiling
Interface
Query
Interface
Knowledge Base
Conclusions
1. A lot of progress has been made in discovery systems and
they have paved the way to a new future.
2. Much remains to be done.
3. Work by DLF under the leadership of John Mark Ockerbloom
will set the direction for a considered approach in the
development and deployment of discovery systems.
4. In the meantime, Data providers and Service providers need
to get together and agree on
Who does what?
How can we make the systems consistent?
How can we do this in a multilingual environment?
How can we bring together different sources of
information under a single discovery system; in other
words how can we best help the users.
Questions
NISO Discovery Tools ForumNISO Discovery Tools Forum
March 26-27, Chapel Hill, NCMarch 26-27, Chapel Hill, NC

Weitere Àhnliche Inhalte

Was ist angesagt?

PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013
Frauke Ziedorn
 
Understanding data -latest
Understanding data  -latestUnderstanding data  -latest
Understanding data -latest
Matt Heward-Mills
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core Metadata
Hannes Ebner
 

Was ist angesagt? (19)

PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013
 
Understanding data -latest
Understanding data  -latestUnderstanding data  -latest
Understanding data -latest
 
Easily Serving and Accessing HDF-EOS2 Datasets Using DODS Technologies
Easily Serving and Accessing HDF-EOS2 Datasets Using DODS TechnologiesEasily Serving and Accessing HDF-EOS2 Datasets Using DODS Technologies
Easily Serving and Accessing HDF-EOS2 Datasets Using DODS Technologies
 
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
 
External CV support in Dataverse 5.7
External CV support in Dataverse 5.7External CV support in Dataverse 5.7
External CV support in Dataverse 5.7
 
Linked Data
Linked DataLinked Data
Linked Data
 
NISO/DCMI Webinar: Metadata for Public Sector Administration
NISO/DCMI Webinar: Metadata for Public Sector AdministrationNISO/DCMI Webinar: Metadata for Public Sector Administration
NISO/DCMI Webinar: Metadata for Public Sector Administration
 
Metadata standards
Metadata standardsMetadata standards
Metadata standards
 
CLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemesCLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemes
 
Multi-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation RecommendationMulti-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation Recommendation
 
NISO/DCMI Webinar: Metadata for Managing Scientific Research Data
NISO/DCMI Webinar: Metadata for Managing Scientific Research DataNISO/DCMI Webinar: Metadata for Managing Scientific Research Data
NISO/DCMI Webinar: Metadata for Managing Scientific Research Data
 
Corrib.org - OpenSource and Research
Corrib.org - OpenSource and ResearchCorrib.org - OpenSource and Research
Corrib.org - OpenSource and Research
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core Metadata
 
Metadata Mapping & Crosswalks
Metadata Mapping & CrosswalksMetadata Mapping & Crosswalks
Metadata Mapping & Crosswalks
 
Metadata crosswalks
Metadata crosswalksMetadata crosswalks
Metadata crosswalks
 
Setting up Dataverse repository for research data
Setting up Dataverse repository for research dataSetting up Dataverse repository for research data
Setting up Dataverse repository for research data
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
Flexible metadata schemes for research data repositories - Clarin Conference...
Flexible metadata schemes for research data repositories  - Clarin Conference...Flexible metadata schemes for research data repositories  - Clarin Conference...
Flexible metadata schemes for research data repositories - Clarin Conference...
 

Ähnlich wie Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"

Gbrds Tech Issues Op
Gbrds Tech Issues OpGbrds Tech Issues Op
Gbrds Tech Issues Op
Vishwas Chavan
 

Ähnlich wie Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data" (20)

Gbrds Tech Issues Op
Gbrds Tech Issues OpGbrds Tech Issues Op
Gbrds Tech Issues Op
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
Digitisation and institutional repositories 3
Digitisation and institutional repositories 3Digitisation and institutional repositories 3
Digitisation and institutional repositories 3
 
Keynote Presentation at MTSR07
Keynote Presentation at MTSR07Keynote Presentation at MTSR07
Keynote Presentation at MTSR07
 
Dataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataDataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* Data
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
EUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan BroederEUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan Broeder
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Open for Business  Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business  Open Archives, OpenURL, RSS and the Dublin Core
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
 
Technical overview of the JISC Information Environment
Technical overview of the JISC Information EnvironmentTechnical overview of the JISC Information Environment
Technical overview of the JISC Information Environment
 
CETIS09 OER Technical Roundtable
CETIS09 OER Technical Roundtable  CETIS09 OER Technical Roundtable
CETIS09 OER Technical Roundtable
 
A Framework for Self-descriptive RESTful Services
A Framework for Self-descriptive RESTful ServicesA Framework for Self-descriptive RESTful Services
A Framework for Self-descriptive RESTful Services
 
Understanding Data
Understanding Data Understanding Data
Understanding Data
 
Metadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the schemeMetadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the scheme
 
Sword Cetis 2007 06 29
Sword Cetis 2007 06 29Sword Cetis 2007 06 29
Sword Cetis 2007 06 29
 
Sword Cetis 2007 06 29
Sword Cetis 2007 06 29Sword Cetis 2007 06 29
Sword Cetis 2007 06 29
 
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 

Mehr von National Information Standards Organization (NISO)

Mehr von National Information Standards Organization (NISO) (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"
 
Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"
 
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
 
Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"
 
Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"
 
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
 
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
 
Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"
 
Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"
 
Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"
 

KĂŒrzlich hochgeladen

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

KĂŒrzlich hochgeladen (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"

  • 1. Improving Discovery Systems Through Post Processing of Harvested Data. NISO Discovery Tools ForumNISO Discovery Tools Forum March 26-27, Chapel Hill, NCMarch 26-27, Chapel Hill, NC Vinod Chachra, PhDVinod Chachra, PhD CEO, VTLS Inc.CEO, VTLS Inc.
  • 2. About VTLS VTLS is the first spin-off corporation from Virginia Tech VTLS has been in this industry for over 30 years VTLS has agents and offices in countries around the world. VTLS does business over 40 countries VTLS Products --has four major product lines Virtua – Integrated Library System VITAL – Fedora based Institutional Repository – developed partly in partnership with the Australian ARROW project. VTRAX – RFID based tracking & security systems for libraries Visualizer – Discovery Tool and Discovery Service
  • 3. VTLS HQ in Blacksburg, VA, USA
  • 4. Opening Keynote Richard Ackerman’s presentation – two examples Unmanned vehicles navigating urban environments Virginia Tech in 3rd in that contest this year behind CMU and Stanford Uppsala Project (400th anniversary) Uppsala University in Sweden VTLS user for more than 15 years
  • 7. Environment Assumptions For the purposes of this presentation we will assume that the discovery environment we work in is: Global Distributed Multi-lingual Contains different types of systems -- OPACS, repositories, streaming media

  • 8. A good example: NDLTD NDLTD Networked Digital Library of Theses and Dissertations www.ndltd.org Data for the Discovery system was harvested From 6 continents In 30 languages From about 70 institutions Using several primary and secondary aggregators harvesting about 380,000 items (records)
  • 9. Players Data Providers Service Providers (Discovery Service Provider) Aggregators Enrichment Information Providers Knowledge Base Providers Users
  • 10. Processes Harvesting the data Duplicate Control Aggregation of data Profiling of systems Enhancing the metadata with external content Creating a service (Discovery Service) Updating the data (Back to Harvesting)
  • 11. Definitions you already know (1 of 2) The following definitions come from OAI-PMH web site: Harvesting (From: http://www.openarchives.org) In the OAI context, harvesting refers specifically to the gathering together of metadata from a number of distributed repositories into a combined data store. Data Provider (From: http://www.openarchives.org) A Data Provider maintains one or more repositories (web servers) that support the OAI-PMH as a means of exposing metadata. (OAI definition quoted from FAQ on OAI Web site) Service Provider (From: http://www.openarchives.org) A Service Provider issues OAI-PMH requests to data providers and uses the metadata as a basis for building value-added services. A Service Provider in this manner is "harvesting" the metadata exposed by Data Providers
  • 12. Definitions you already know (2 of 2) Aggregator (From: http://www.openarchives.org) An OAI aggregator is both a Service Provider and a Data Provider. It is a service that gathers metadata records from multiple Data Providers and then makes those records available for gathering by others using the OAI-PMH. Enrichment Information Provider A service provider that provides additional descriptive, images or streaming media information related to any displayed result set making it more attractive or useful to the user. Knowledge Base Provider More on this later User YOU, ME
  • 13. On Harvesting Economies During the March 6, 2008 meeting of the DLF group in Berkeley, CA it was reaffirmed that the OAI-PMH protocol was the preferred method of acquiring data for discovery systems. Frequent harvesting of the entire content of data providers should be discouraged. It is preferred that selective or incremental harvesting be used. Aggregators should be used where possible to decrease the overall system load
  • 15. Discovery System Characteristics: Example 1 Example 1: http://rogers.vtls.com:6080/visualizer System characteristics-- Fast Support distributed content “Push” information to user No initial search required Re-compute all facets for each result set Act like a union catalog Allow individual library branding Support communities and collections Support profiling Enhance discovery through “Knowledge Base” Support drill down capability
  • 16.
  • 17.
  • 18.
  • 20.
  • 21. Discovery System Characteristics: Example 2 Example 2: http://rogers.vtls.com:7080/visualizer System characteristics Fast Support distributed content “Push” information to user No initial search required Re-compute all facets for each result set Act like a union catalog Allow individual library branding Support communities and collections Support profiling Enhance discovery through “Knowledge Base” Support drill down capability
  • 26. Coded Data Code Lists Language Content Format Target Audience Etc. String Lists Circa; A.D., B.C. Abbreviations – MESH, LCSH, ALA Associations If 2nd indicator has a value “x” then it means “Y”. If 2nd indicator has a value “z” then look at $2
  • 27. Coded Data Code Lists do not use the same code for different sources 2 character or 3 character language abbreviations Difference in coding between Marc21 and UniMarc Codes are not known to most users Requires a knowledge base to decode. Coded data does reduce the size of the data during harvesting but requires complex and sometimes impossible decoding at the service provider end.
  • 28. Who should decode the data? Data Provider? It appears that the data provider is best quipped to do so. It is their data and therefore they have the greatest potential of doing it correctly. But there are issues Language : Example – target audience “juvenile” in 30 different languages poses more problems for the service providers than simply decoding the data upon receipt. Timeliness (takes too long to co-ordinate lots of data providers) Understanding -- Service provider has a better understanding of the global picture and the needs of the service. It is easier to support multilingual searching and multilingual display if the data is coded for use by the search engine.
  • 29. Who should decode the data? Another view of coding data Codes for language = French are Sometimes “fr” Sometimes “fre” Why not make the code a string like “French”. Think about it ! In my opinion: In today’s environment -- it is better for the service provider (or the aggregator as a service provider) to decode the data. It provides for better consistency. In the long run -- better for the data provider to do so.
  • 30. Codes in Classification Systems Dewey Decimal system LC Classification system Do we really need to show the classification system to the users? Why not display only the strings? Easier for users Easier for switching languages Allows for meaningful drill down capabilities. See example
  • 31. Characteristics of Facets Modern discovery tools are facet based Entries in Facets can be One to One : Each metadata record has exactly one entry in the facet (like date of publication; language; location) One to many: Each metadata record can have more than one entry in the facet table (like subjects, authors, etc) Drill down: A hierarchically arranged facet that the user can drill down on.
  • 32. Characteristics of Facets LC Subject; Subject subdivisions Mime Type in complex objects One to Many Date; LC Class; Dewey; Place of Publication Language; Content; target audience One to One Drill Down Facets No Drill Down Facets Examples
  • 33. Consistency Considerations Number is the result set MUST equal the sum of hits in all one-to-one facets. Need a method to handle “missing data” or “not available”. Added benefit of identifying the records with “missing” data. Also applies to one-to-one drill down facet data. See example When the scope of the discovery system includes data from sources that use multiple classification systems then how do we aggregate this information? Or do we simply let the chips fall where they may? See example
  • 34. Normalization of Data In support of duplication control Capitalization Handling of Diacritics Ending punctuation Embedded punctuation (ISBD punctuation) Misspelled terms Standard forms based on Authority Control systems Normalization for sorting purposes (US sort versus European sort or Swedish sort)
  • 35. Harvested Data Formats MARC21 UniMarc Dublin Core Qualified Dublin Core MODS METS EAD LAP (Local Application Profile) Numerous crosswalks will be needed between them Unless each data provider can support the harvesting of data in several agreed upon formats. Crosswalks by data provider (for some subset of formats)
  • 36. Harvested Data Formats Data providers should provide data to service providers through OAI-PMH in (at a minimum) Some internationally acceptable format In a published local format To convey the richness of the content To permit deep linking for objects that are complex objects in the collection
  • 37. Enrichment Process Well understood process They supply additional useful information in support of the selection process Images – including book covers; streaming video; TOC; Author notes etc. Several suppliers of this information Syndetics Video Detective Google Amazon OCLC More suppliers will come
  • 38. Use of Knowledge Base Knowledge base may be Global (applies to all data providers in a class) All data providers using MARC 21 based metadata All data providers using DC data All data providers using XMLAP (XML based application profile) or LAP (Local Application Profile) Local (applies to specific data providers) Institutions Locations within institutions Cantons and other location categories.
  • 39. Use of Knowledge Base Examples of Global Knowledge Base LC Classification (conversion of codes to strings) Dewey Classification (conversion of codes to strings) State/Country>> Country>> Continent Aggregation Format (Data from 006 and 008) Language code to text (008 positions 35 – 37 to text) Target Audience (008 position 22 to text) Contents (008 value to text)
  • 40. Use of Knowledge Base Local Knowledge Base Location names and codes Institutions names and codes Main Locations and Sub Locations Source of content From Library OPAC From Institutional Repositories From Government Documents From State Legal Documents External Link to Content (enrichment process)
  • 41. FRBR – UCL Catalog
  • 42. FRBR – UCL Catalog Examples
  • 43. FRBR – UCL Catalog Examples Owned By Field
  • 44. FRBR – UCL Catalog Examples Owned By Multiple Institutions
  • 45. FRBR There are two approaches to the implementation to FRBR Store the internal data in a hierarchic linked record format. FRBRize records upon display keeping the storage system like a traditional flat catalog Since records are cataloged only once and displayed many many times it is better to use the first method. When harvesting FRBR records do we unFRBRize them and then harvest? Or do we harvest and them and then do some post processing? Issue remains unresolved.
  • 46. Duplicate Control -- De-Duping FRBR For a discovery tool used in a multi-institutional environment should a record representing the same manifestation (work + expression + manifestation in FRBR terminology) be repeated if it comes form multiple institutions or should it be de-duped? If it is de-duped then what do we do if one of the institutions has the record as a FRBR record and the other does not?
  • 47. Handling of Names and Subjects Developing solutions using URIs SKOS – Simple Knowledge Organization System According to Roy Styles and others, in their article presently in draft form, called “Semantic MARC, MARC21 and the Semantic Web” LC is working towards releasing their subject headings on the web using SKOS. If this happens, it will make possible several advances in creating web based linked authorities. The possibilities are immense but what do we do now for name and subject authorities in discovery systems. See example
  • 50. Branding and Drill Down (3 of 4)
  • 51. Branding and Expanded Search (4 of 4)
  • 52. How Many Facets? Basic facets and Extended facets Basic facets -- minimal set for every implementation Extended facets – additional facets for special use How many facets? Too few facets are ineffective Too many facets are not user friendly How to identify One to one facets One to many facets Drill down facets
  • 53. How does it work? 1. Harvest data: OAI-PMH used for harvesting the metadata 2. Create KB: Apply the “Knowledge Base” to the Metadata 3. Profile the system for proper facets Facets on the raw data Facets on the derived data from knowledge base 4. Create standardized input for indexing 5. Apply indexing for use by search engine 6. Throw away the harvested metadata but retain index 7. Discover 8. Hyperlink to the source for display of content
  • 54. Visualizer -- Expanding the Architecture The problem is massive. We can admit defeat or build systems to cope with it. These systems must take advantage of the capabilities of the computer and combine it with the knowledge, expertise and inference ability of humans. Two Questions How do you organize the world’s information? How do we visualize the nature and depth of content? VTLS Visualizer OPAC -- Facet Based Searching ILS 1,2,3 Repository 1,2 ANY SYSTEM MAPPING ROUTINES MARC 21 STANDARDIZED INPUT FACET SEARCH ENGINE STANDARIZED OUTPUT DISPLAY MANAGEMENT D.C. XML Direct Direct Profiling Interface Query Interface Knowledge Base
  • 55. Conclusions 1. A lot of progress has been made in discovery systems and they have paved the way to a new future. 2. Much remains to be done. 3. Work by DLF under the leadership of John Mark Ockerbloom will set the direction for a considered approach in the development and deployment of discovery systems. 4. In the meantime, Data providers and Service providers need to get together and agree on Who does what? How can we make the systems consistent? How can we do this in a multilingual environment? How can we bring together different sources of information under a single discovery system; in other words how can we best help the users.
  • 56. Questions NISO Discovery Tools ForumNISO Discovery Tools Forum March 26-27, Chapel Hill, NCMarch 26-27, Chapel Hill, NC