Co-presented with Petr Knoth http://www.slideshare.net/petrknoth/ at the "Mining Repositories: How to assist the research and academic community on their text and data mining needs" workshop, which took place at the 11th International Conference on Open Repositories, Monday 13 June 2016.
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
How can repositories support the text-mining of their content and why?
1. How can repositories support the
text-mining of their content and
why?
@openminted_eu
Dr. Petr Knoth and Dr. Nancy Pontika
Knowledge Media institute, The Open University
United Kingdom
Twitter: @oacore
6. TDM & Repositories
Managers
@openminted_eu
• Established and maintain a close collaboration with
researchers
• Extensive experience in advocacy, i.e. open access
• Knowledgeable about the repository’s collection
• Participate in the Academic Institution’s Research
Committees
• Knowledgeable of your repository’s collection
• Familiarity with Copyright issues and Creative Commons
Licenses
7. How can repositories support
TDM?
TDM is all about processing text and data at
scale. The role of repositories is to facilitate the
aggregation of research papers at a full-text level
(and beyond) effectively enabling TDM services
to operate seamlessly on all available research
content.
7
8. What is the problem?
@openminted_eu
• A small study (Knoth, 2013)
• 83 repositories - mainly Eprints with PDF research
outputs
• 1,461,016 metadata records
metadata linked
to content
content
downloadable
content
machine
readable
Mean 54.1% 34.4% 27.6%
Median 39.5% 16.7% 13.0%
Standard
deviation
39.2% 34.2% 31.0%
9. How is content aggregated
today?
@openminted_eu
• DC over OAI-PMH: vast majority of repositories, never
intended to support content harvesting. The main problem:
linking metadata with content.
“The nature of a resource identifier is outside the scope of the OAI-
PMH. To facilitate access to the resource associated with harvested
metadata, repositories should use an element in metadata records
to establish a linkage between the record (and the identifier of its
item) and the identifier (URL, URN, DOI, etc.) of the associated
resource. The mandatory Dublin Core format provides the identifier
element that should be used for this purpose.”
10. How is content aggregated
today?
@openminted_eu
• RIOXX: Just one identifier, recommends the identifier
points to the actual resource being described.
• OpenAIRE Guidelines: identifier links to either the
resource or a jump-off page. Does allow multiple
identifiers.
• ResourceSync
• CrossRef: comercial publishers/journals
12. Principle 1: content
referencing
Repositories should always establish a link from
the metadata record to the item the metadata
record describes using a dereferencable identifier
pointing to the version held locally in the
repository. The dereferencable identifier should
be provided in the appropriate metadata element
in the used metadata format (i.e. dc:identifier in
the case of Dublin Core). If multiple identifiers are
used, it is recommended listing the local
dereferencable identifier first.
1
14. Principle 2: Content
accessibility to machines
Repositories must provide universal access to
machines with the same level of access as
humans have. It is the role of repositories to
allow aggregators to harvest the entire content of
the repository in a reasonable time to enable
acquiring and maintain up-to-date information
about the repository content.
1
15. What can repositories do?
@openminted_eu
• Ensure correct referencing of content from metadata:
• Dereferencable link which resolves to content
• Locally held (content under its control)
• Using a standard repository platform can help
• Check robots.txt
• Register your repository
• Advocate for good pdf (media) quality of deposited content
• Use monitoring tools
• CORE Repository Dashboard
• OpenAIRE Repository Manager Dashboard
• Machine readable licensing
17. Interested in how to TDM
research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:00
Mining Open Access
publications with CORE
18. Interested in how to TDM
research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:20
Oxford vs Cambridge
Contest: Collecting Open
Research Evaluation
Metrics for University
Ranking
19. Interested in how to TDM
research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Papers 4, 4:00
Exploring
Semantometrics:
full text-based
research
evaluation for
open repositories
20. Thank you
Dr. Pert Knoth,, Research Fellow
petr.knoth@open.ac.uk
Dr. Nancy Pontika, Open Access Aggregation
Officer
nancy.pontika@open.ac.uk
.
2
Hinweis der Redaktion
Mining individual repositories is not intersteing. TDM is about processing at scale. The role of repositories is: …
So why am I talking about what the role of the repositories is? Well I think we have a slight problem here … We have done a study to …
The main problem: linking metadata with content.
OpenAIRE guidelines: https://guidelines.openaire.eu/en/latest/literature/field_resourceidentifier.html
The ideal use of this element is to use a direct link or a link to a jump-off page (persistent URL) fromdc:identifier in the metadata record to the digital resource or a jump-off page.
<dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.
The problem is not multiple metadata formats, but the fact that none of them is good enough! Thinking that by supporting the guidelines you allow content aggregation is an issue.
Locally means within the repositories control.
<dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.
Arxiv has now a slightly nicer robots.txt where anoyone is allowed access with a 15s delay. Still not doable …
Platform: For those who haven’t deployed a repository yet, it is highly advised that the repository platform is not built in house, but one of the industry standard platforms is chosen. The benefits of choosing one of the existing platforms is that they provide frequent content updates, constant support and extend repository functionality through plug-ins.
Our ultimate goal is to put in place infrastructure that will enable anyone to make sense of large volumes of scientific data.
The infrastructure is open and transparent.
If you are interested in how we makes sense of the large volumes of scientific content.
If you are interested in how we makes sense of the large volumes of scientific content.
If you are interested in how we makes sense of the large volumes of scientific content.