Persistent Identifiers (PiDs) for research – why we have them, why there are so many PiD systems, how they work looking at a few examples (Handles, DOIs, ORCIDs), how to choose one, can PiD systems fail and what’s happening in the international PiD community
3. What are Persistent Identifiers (PiDs)?
A persistent identifier is a long–lasting reference
to a digital resource
Photo attribution: Jan Hettenhausen - j.hettenhausen@griffith.edu.au (reproduced with permission)
4. Use PiDs to connect…
Researchers Publications
Data
Software
Methods
Equipment
???
Why use PiDs?
PiDs play a key role in the discoverability,
accessibility and reproducibility of research.
5. Why are there so many PiDs?
Marked by differences in:
• Purpose
• Scope
• Underlying technology
• Governance and social infrastructure
• Metadata collected
• Cost
• Extent of use
ARK
PURL
NLA party ID
6. Example: The Handle System
• Run by CNRI
• Robust system
• Widely used in publication repositories
• Used to identify research datasets
7. How do Handles work?
Example: http://hdl.handle.net/11343/130078
http://handle.net = resolver service
/
11343 = prefix identifying assigning body (Uni Melb)
/
130078 = suffix identifying resource (Melb Uni report)
8. Example: Digital Object Identifiers (DOIs)
• Run by international DOI Foundation
• Robust – built on the Handle System
• Origins in publishing industry
• Used to identify and cite publications and
research datasets
• The most widely used PiD for research data
9. How do DOIs work?
This is an example from Griffith University:
http://doi.org = resolver service
/
10.4225 = prefix identifying the assigning body (ANDS)
/
01 = Suffix 1 – the institution identifier (Griffith University)
/
4F3DB08617645 = Suffix 2 – the resource item or collection
identifier (a dataset held in the Griffith data repository)
10. More about DOIs
• Metadata required! Example: DataCite Metadata Schema
https://schema.datacite.org/
• DOI search services e.g. DataCite
https://search.datacite.org/
• Cost involved but some agencies like ANDS offer a free
service
• To get a DOI through the ANDS service: m2m or manual
minting
11. Example: ORCIDs
• Run by ORCID organisation
• Identifier for people (researchers)
• Links people with their research ‘works’
• Widely used internationally
• Australian research sector-wide endorsement
• Embedded in scholarly workflows
12. How do ORCIDs work?
https://orcid.org/0000-0003-0635-1998
• 16 digit identifier based on ISNI block
• Prototype: Thomson Reuters ResearcherID
• Most metadata fields are optional
• Free for researchers, fee for members
(organisations)
• Public API (free) and premium API
(members)
• Transparent governance and development
process
13. The power of linking PiDs
• International efforts to link ORCIDs
(researchers) with DOIs (publications and
data)
• The Scholix initiative:
• a global framework to improve the links
between publications and data
• beneficial for all, especially publishers
(display this link in journals) and
repositories (link back to data held in
repositories)
14. Which PiD to choose?
Evaluate the PiD service:
• Purpose
• Scope
• Underlying technology
• Governance and social infrastructure
• Metadata collected
• Cost
• Extent of use
• Trustworthiness?
Choose the best fit PiD for the type of resource and it’s point in the
research lifecycle
Better to choose one than none!
15. PiDs sound great - but hang on….?
Erm…
• Recent PiD crises: PURL, LSID
• “Zombie PiDs”?
Remember:
• PiDs are both social and technical systems
• Governance/ organisations can be the achilles heel of PiD
systems
See: Klump, J. & Huber, R., (2017). 20 Years of Persistent Identifiers
– Which Systems are Here to Stay?. Data Science Journal. 16, p.9.
DOI:http://doi.org/10.5334/dsj-2017-009
Have PiD systems ever failed? What’s the
guarantee they will stay “long lasting”?
17. Summary
• PiDs play a key role in the discovery, accessibility and
reproducibility of research.
• There are many PiD systems which vary in purpose, scope,
underlying technology, governance and social infrastructure,
metadata collected, cost, extent of use.
• When evaluating which PiD to assign to a resource, consider:
• The differences above and importantly, trustworthiness
• Better to assign a PiD or more than no PiD at all
• Remember that PiDs are about social as well as technical
infrastructure. It is the responsibility of the PiD owner (e.g. a
university) to update the PiD if the resource location changes.
• PiDs are evolving so get your geek on and join in the discussions!
18. Further resources
• ANDS website for PiD Guides, DOI service, Handle
service:
http://www.ands.org.au/
• ANDS PiDs short bites webinar series:
https://www.youtube.com/user/andsdata (persistent
identifiers playlist) - more to come in this series!
• THOR Project: https://project-thor.eu/ and webinar
series:https://project-thor.eu/2017/05/05/webinar-series-
pids-what-why-how/
• ICSU/CODATA Data Science Journal special issue: 20
years of Persistent Identifiers
http://datascience.codata.org/collections/special/20-
years-of-persistent-identifiers-applications-and-future-
directions/
19. With the exception of logos, third party images or where otherwise indicated, this
work is licensed under the Creative Commons Australia Attribution 3.0 Licence.
ANDS is supported by the Australian
Government through the National Collaborative
Research Infrastructure Strategy Program.
Monash University leads the partnership with
the Australian National University and CSIRO.
Natasha Simons
natasha.simons@ands.org.au
Tw: @n_simons
ORCID: https://orcid.org/0000-0003-0635-1998
Editor's Notes
Thank you to Jaye and the team for inviting me to speak at the inaugral VALA Tech Camp
I’ll start this talk with a confession:
I’m Natasha from ANDS and I’ve been a PiD nerd for about 7 years now
It all started when I was working at the National Library and became the Business Analyst on the very attractively named “Party Infrastructure Project” – an ANDS funded project to develop identifiers for people and organisations in Trove
I went on to work as an IT project manager in eResearch Services at Griffith University where I minted the first DOI in Australia for a dataset using the ANDS DOI minting service
There were a lot of learnings from this and I wrote about it journal articles and blogs
Then I joined ANDS and worked on the national ORCID Working Group to develop a sector-wide approach to ORCID and helped shepherd in the Australian ORCID Consortium with 40 institutional members
I am also an ORCID Ambassador
So I’ve done a lot of work in the area of PiDs but I still feel far from an expert on the topic
Today I’m going to share what I know about PiDs for research – why we have them, how they work, how to choose one and what’s happening in the international PiD community
[hands up: who has heard of ORCID? Who has an ORCID? Who has heard of DOIs? Handles? How about IGSN?]
First of all, what’s the problem that persistent identifiers are trying to address?
Everyone will be familiar with this –clicking on a web link that takes you either to a ‘page not found’ error page like this one or to content that is actually unrelated to the link you clicked. Both usually happen because a web resource has been moved to another location and you have the old link.
A ‘page not found’ error is frustrating and in the context of research, it is disastrous. It means a scholarly resource, which may have been cited, cannot be found, verified, potentially cited again and so on.
This is the problem that persistent identifiers are there to address.
A persistent identifier is simply a long–lasting reference to a digital resource.
Even if the resource moves location on the web, the persistent identifier is there to make sure the link always resolves.
So if a PiD is used as a citation link in scholarly literature, it will always resolve to information about the resource (either a descriptive metadata page, the resource itself, or information about the removal of the resource from the web).
PiDs are key to facilitating the discovery of scholarly resources like journal articles and research data. They also play a role in linking scholarly resources (e.g. publications and data) as well as tracking the impact of these resources. It’s important to note that PiDs do not guarantee a link will never be broken but they create a technical and social framework which helps to guarantee it.
PiDs play a key role in the discoverability, accessibility and reproducibility of research
How do you they do this?
Provide social and technical infrastructure to identify a research output over time
Enable machine readability
Apply to a variety of research objects and related “things” – researchers, institutions, outputs
Enable research objects and things to be labeled uniquely and disambiguates one from another
Facilitate the linking of research objects, related people and things so a reader may discover a publication, it’s related dataset, software, methods etc.
PiDs are an integral part of the semantic web
So why are there so many PiD systems? Well, each PiD systems is different from another. They vary in:
Purpose – for example they can general – all scholarly resource types e.g. DOIs, OR discipline specific e.g. Life Sciences ID
Underlying technology (more on this shortly)
Governance e.g. non-profit, cross-sector collaboration effort or company-driven
Metadata collected – some require more than others
Cost – some are free, some not
Extent of use – PiDs vary in uptake
Most PiDs for research work by separating the identity of a scholarly object from its location on the web
Let’s look at the Handle System as an example.
Handles are run by the Corporation for National Research Initiatives (CNRI) in the USA
CNRI is a is a not-for-profit organization formed in 1986 to undertake, foster, and promote research in the public interest.
The Handle system is very robust and is widely used internationally among repositories. It also provides the underlying infrastructure for Digital Object Identifiers (DOIs).
Characteristics:
Central handle registry where handle identifiers are recorded
Distributed computer system including handle proxy servers
Model: assign one Handle per resource
Minimal cost (and this is usually borne by the Handle issuer such as an institution running a Handle proxy server)
Unique, global, scalable, reliable
Note: PiDs are both technical AND social infrastructure, so If URL of a resource changes then the owner must update the URL in the Handle system
The Handle identifier is made up of:
a prefix that identifies the “naming authority”
a suffix that identifies the “local name” of the resource
a resolver service: http://hdl.handle.net
Let’s look at another example of persistent identifiers: DOIs
These came from the scholarly publishing industry. DOIs are routinely assigned by publishers to identify journal articles and other published works. There is a great deal of technical and social infrastructure invested in DOIs and according to recent research by the THOR project they are by far the most widely used persistent identifier for research objects including research data.
DOIs are:
An implementation of The Handle System
Applicable to a variety of digital objects e.g. in research: publications, data, software, methods, “grey literature”, theses etc.
Governed by the International DOI Foundation which is a not-for-profit organisation
DOIs are issued by DOI Registration Agencies or their agents
CrossRef: scholarly publications
DataCite: datasets, software, “grey literature”
Agent examples: EZID CDL, BL, ANDS
Unique, global, scalable, reliable
Like the Handle system it is built on, DOIs have:
a prefix that identifies the “naming authority”
a suffix that identifies the “local name” of the resource
a resolver service
More metadata is required to mint a DOI than a Handle.
For Handles - you can get away with pretty much just the URL and Title of the resource.
For DOIs – much more is required and there are many optional and recommended metadata elements as well
DataCite schema example – 6 mandatory, 6 recommended, and 6 optional fields
Because more metadata is collected, DataCite also offer a search service – all datasets, software, grey literature etc minted with a DOI in the one search portal
Cost – minimal but may be covered by the DOI agent e.g. ANDS
For the ANDS service: accessed by institutions not individuals – m2m and manual options for minting and managing DOIs
Similar to Handles, DOIs require a commitment from the owner of the resource to manage updates to the location of the resource within the DOI infrastructure
e.g. if it moves location, update DOI. If it is removed, update DOI with location of your tombstone record
You can see from this that persistent identifiers do not guarantee the long life of the resource itself, they work to guarantee the long life of access to information about the resource
An example of a different type of identifier is ORCID.
ORCID is:
An identifier for people
Enables researchers to uniquely and unambigualously identify themselves from other researchers with the same name AND link all of their scholarly works in the one record regardless of the work type (important for credit and attribution)
a not-for-profit organisation supported by members
unique, global, scalable, reliable
collect metadata – majority optional
have Australian research sector-wide endorsement (plus a consortium)
Have fast become the international standard for research identifiers – embedded into scholarly publishing workflows, endorsed and supported by every stakeholder in the research sector
16 digit identifier based on ISNI block
Prototype: Thomson Reuters ResearcherID
Most metadata optional:
Some synched to record from systems like CrossRef and DataCite
Some manual input
Free for researchers, fee for members (organisations)
Public API (free) and premium API (members)
Transparent governance and development process (see public Trello boards)
Linking persistent identifiers plays a key role in research reproducibility, discovery and accessibility
That’s why there are international efforts to do this
Two I will mention here:
THOR project based in Europe has undertaken efforts to link ORCIDs with DOIs
Scholix initiative is new and comes out of the World Data Service and the Research Data Alliance: involves publishers and infrastructure providers; provides a global framework to improve links between publications and data
beneficial for all, especially publishers (display this link in journals) and repositories (link back to data held in repositories)
More info on ANDS website and recent webinar on this topic
Evaluate the PiD service:
Purpose
Scope
Underlying technology
Governance and social infrastructure
Metadata collected
Cost
Extent of use
Trustworthiness?
Choose the best fit PiD for the type of resource and it’s point in the research lifecycle
Better to choose one than none! A resource – over time – may even get 2 PiDs and in the future these PiDs may be linked via a provenance trail e.g. this dataset had a handle and now it has a DOI.
When minting the PiD, include as much metadata as you can
More metadata helps linkages, attribution and discovery
More metadata helps linkages, attribution and discovery
PiD crisis:
PURL – introduced by OCLC and there are over 16,000 PURLs in Google Scholar. Around 2015 OCLC lost interest, tech freeze about 18 months, eventually Internet Archive took over and has brought PURL back from the brink
LSID – strongly supported by biodiversity informatics communities standardisation authority but in recent years the technology was the topic of hot debate and the system came into crisis. Maintenance was terminated and a resolver made available in the interim as discussions continue. Meanwhile about 14,000 LSIDs are listed in Google Scholar and their future is in doubt
Many PID systems were developed by various communities and, for different reasons, have failed to withstand the test of time, eventually sliding into paralysis and what Jens Klump from CSIRO calls a ‘zombie’ stage where identifiers continue to exist but the PID system loses its resolution service.
Klump and his colleague Huber suggest PiD governance is the key – they suggest PiD systems have exit plans and that a universal evaluation criteria be developed for assessing PiD systems.
There is a cool and groovy international PiD community and a lot if happening:
PiDapalooza – 2 day “festival” (conference) of everything PiD related – was in Iceland last year and on again next January in California
THOR project funded by European Commission
Goal: every researcher has seamless access to persistent identifiers and works will be uniquely attributed to them
Nice work done on PiD usage statistics and targetted uptake of PiDs for research
Fantastic webinar series
CrossRef, DataCite, ORCID collaboration project on persistent identifiers for organisations and other PiD related collaborations
International RDA PiD Interest Group
New PiD on the Block in Australia: RAID – Research Activity Identifier built on the ANDS Handle service to identify activity as it happens at different points in the research lifecycle. First customer will be University of Queensland.
If you’ve found this talk exciting, come join me and be a PiD nerd too!
Here are some resources in the slides that you can access to get you started
Even if you don’t want to join the legions of PiD nerds, I hope you have all learned something from this talk, thanks for listening