This presentation mainly intends to explain how to gather the author name, title, submission year, keywords, subjects, department, university, language, and the number of pages of every thesis document in Theseus and then reuse the gathered data for building a Web portal.
3. Outline
Goal and Motivation
Theseus.fi
Dspace
Getting Data out from Dspace
Dspace OAI-PMH as a Data provider for Theseus
Request Types(Verbs)
Flow Control
Harvesting Data from Theseus’s Data provider
Project Result
Final thoughts
4. Goal
Harvest metadata of thesis documents
from Theseus
author name, title, keywords, submission year....
Store the harvested data into a separate
MYSQL database.
Build a Web portal out of this stored data
Goal and Motivation
Why conduct this project?
Thesis data analysis and visualization of
overall statistical facts.
Compare thesis documents
Compare universities and departments
Analyse trending keywords used by
students every year
5. Theseus.fi
Digital libraries are now commonly used by academic institutions
worldwide.
Theseus provides online access to theses and publications from Finnish
universities of applied sciences.
End users can search, browse and upload thesis documents to Theseus.
6. ...
Theseus also has an API that can be used by third party organizations to
utilize theses data.
Theseus is powered by a pioneer open source digital asset management
system called Dspace.
Functionalities and features of Theseus are inherited from Dspace.
7. Dspace
Dspace is an open source software platform that provides stable, long-term
storages commonly for digital intellectual materials.
Many academic institutions worldwide use Dspace to offer their users an
easy access to their digital resources.
Dspace can be freely downloaded and used or even modified to store digital
materials.
9. Getting Data out from Dspace
OAI-PMH is HTTP based protocol that defines methods and protocols for
sharing, publishing and archiving metadata from Dspace repositories
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is
used to programatically access data from Dspace.
10. Dspace OAI-PMH as a Data provider for Theseus
Dspace repositories have an 'OAI Base URL' in addition the URL for human users.
OAI Base URL : http://publications.theseus.fi/oai/request?
URL for human users : https://www.theseus.fi/
This URL is used in machine to machine communications between data consumers
and data harvesters.
When harvesting request is made using the OAI Base URL , Theseus’s data
provider returns XML formatted metadata of thesis documents.
11. …
Theseus OAI-PMH exposes thesis documents in twelve unique metadata
formats.
KansalliKirjasto format:
<kk:field schema="dc" element="contributor" qualifier="author" language="none" value=" Denut,
Nicolae "/>
OAI Dublin Core format :
<dc:creator> Denut, Nicolae </dc:creator>
Each metadata format can be queried to get any data from Theseus’s data
provider.
12. Request Types (Verbs)
There are six methods in OAI-PMH that can be appended to OAI based URLs
to access different repository contents.
Theseus implements all six request types to provide thesis metadata to
harvesters.
1. Identify: fetches information about Theseus data-provider itself
2. ListMetadataFormats: returns a list of available metadata formats supported by a
Theseus data provider
3. ListIdentifiers: lists thesis record identifiers
13. …
4. ListSets: retrieves the set structure (list of universities and departments) .
5. ListRecords: gets list of complete metadata of thesis documents from a Theseus and
6. GetRecord: retrieves individual metadata of a thesis document
By attaching any one of these request types to Theseus’s OAI base URL,a
request URL can be formed.
16. Flow control
The three request types ListIdentifiers, ListSets and ListRecords return large
lists from Theseus.
In such cases, it is practical to partition them among a series of requests and
responses.
Resumption tokens are options from OAI protocol that allow data providers
to chunk long list responses in parts.
18. Harvesting Data from Theseus’s Data provider
Simple HTML DOM parser, is an open source parser library written in PHP to
read, modify, and return structured content from external data sources.
This parser library can create a Document Object Model by loading
structured data from a URL.
To get nodes of the DOM object , this library provides a method called
“find ()”.
19. Universities Departments Thesis documents
Identifier (setSpec) identifier (setSpec) Thesis Identifier
University name Department Name Author names
ListSets Request URLs ListSets Request URLs Titles
Total number of papers Total number of papers GetRecord request URLs
University identifiers Department identifiers
University identifiers
Keywords
Subjects (official keywords)
Number of pages
year
Language
Summary of gathered theses metadata
21. Project Result
• How many Thesis documents are in Theseus?
• Which school has what amount of papers in Theseus?
• How many papers is each school publishing every year?
• What departments are there in each school?
• How many papers belong to which department?
• How many pages does each paper have?
• In what language is the paper written?
• How many times has each paper been downloaded by
Theseus visitors?
• What are the keywords of each thesis document?
22. The built Web portal aims to give better insights on the contribution of each
school to Theseus on its front page.