The document summarizes system updates made by Crossref in 2014, including improvements to infrastructure like hardware, network resiliency and production systems that reduced DNS latency. Core system changes enhanced performance and call-back notifications. Features were added for books, standards, metadata queries and schema. Planned future updates involve integrating ORCIDs, cleaning article titles, modeling relations, redesigning stored queries and adding new content types.
6. System update 2014
core system changes
performance
call-back notification
conflicts
7. October deposit – example of a heavy month
94,349 deposits from
214 depositors took < 60
minutes
247,802 deposits from
1613 depositors took <
10 minutes
System update 2014
Hardware :
put into service this past year: (we’re pretty much a Dell shop) Dell 720xd: Oracle database server,
new virtual host server (cr14), new storage array in Denver
Replaced our front end Internet handoff routers to provide better failover capability
Network:
We’ve switched to managed DNS services provided by Dyn.
This provides reduced name resolution latency, a more robust global DNS infrastructure.
And the ability to load balance and direct traffic based on location.
The servers behind the rest-api (api.crossref.org) are probably the first that we’ll take advantage
of for this capability
Resiliency:
While a lot that we do every day really falls under this topic, specific actions taken involve
bolstering out ability to operate through database interruptions. As some of you may know
we still use Oracle as the main datastore. Oracle does of course offer very sophisticated
enterprise solutions however they come at a steep price. Therefore we’ve implemented our
own solutions which address continuous real time replication to our disaster center and auto
failover for read-only operations to a local backup database.
Production:
Over the past few months we’ve started moving some of what are called Labs projects into a
production environment. This means they operate out of our main datacenter and, perhaps
more importantly, additional staff are becoming familiar with their configuration, deployment and operation.
Performance:
Deposit performance is generally not an issue. Only at times when large updates occur does
the Q get backed up. Monthly average wait times are about 30-60 minutes but by count the
vast majority of deposits get processed in a few minutes.
Query performance as improved significantly with a number of implementation changes.
Through configuration changes we’ve increase the throughput for metadata distribution.
In October we had 112 million DOI queries up from a monthly average of 34 million seen in 2013
Callback notifications:
We’ve finally implemented an alternative to receiving deposit log files by email. We’ve always
had an alternative to email where you poll a deposit job looking for a completion status and the
retrieve the log via a specific API. With call-backs you implement an end-point that will receive a
completion notice when the deposit is done. The notice contains details on how to retrieve the
deposit log results, that being a URL to retrieve the data. Call-backs also work for batch query
jobs and for cited-by link alerts (which can be large). Of course we’ve been aggregating cited-by
link alerts for a few years now which has dramatically reduced the email problems we’ve had.
Conflicts:
This has been a challenging topic for staff and members to stay on top of. The original implementation
(still in operation) could create many more conflicts than necessary. We’re currently completing a project
to clean up many of the outstanding unresolved conflicts.
1.35 million DOIs have been in a conflict at some point, 473K remain in a conflict that has not been addressed.
Books:
We’re nearing the end of implementing a change to books that will ease the process of
assigning DOIs to book content that is hosted in several locations. Coding is done and
testing is under way. We’ll be looking to pilot with a few publishers at the start of the year.
Standards:
A working group consisting of members who deposit DOIs for standards has been focused on improving
the overall treatment of standards DOIs at crossref. The major outcome to date has been a revision of the
deposit schema now placing the main emphasis on designator. In December the group has an in-person
meeting scheduled to finalize deposits and to address the query processes to maximize discoverability.
Meta-data query:
Changed the response when conflicts are present. This is when two (or more) DOIs have the exact same
metadata which use to always result in no DOI being returned to the caller. Now, we pick a DOI based on the
most recent deposit or ownership where we select a DOI that is owned by the same member who owns the title.
This solves a common conflict problem where a new publisher acquires a journal and deposits new DOIs for
articles that the former owner had already assigned a DOI.
Schema:
Allow for abstracts, as of Nov 5 we have 53,915 deposits.
Licensing information in support of text & data mining.
And recently a beta version of a relations sub-schema that allows for establishing relationships between things
assigned a crossref DOI and other item, that may/may not have a DOI or a non-crossref DOI or may be identified
using some other scheme like pubmed IDs or URIs.
ORCID:
we’ve recently achieved a common understanding with ORCID on a workflow where
crossref will post to author’s via ORCID for them to accept articles into their profile
Article title cleanup:
Over the years, with various processing bugs at times (on both sides
crossref and the depositing members) a number of article titles have gotten
mangled with respect to non-ASCII 7, or what are sometimes called, special
characters. We’ll have a cleanup project to make corrections where possible.
Relations:
Version 4.3.5 of the deposit schema now includes a beta version of a relations
sub-schema (technique now being used to expand the schema). This provides for
the creation of relations between crossref DOIs and items identified with a DOI or
some other identification scheme.
Stored queries:
We have 252,930,184 unresolved stored queries (29,512,735 resolved).
Right now we run about 1 million a day.
We recently discovered a few flaws in the cyclic processing logic which introduced
unacceptable latency in processing all queries. We’re in the process of fixing these now
and are looking for ways to improve the process.
New content types:
For awhile not members have been depositing DOIs for content that does not exactly fit
the genres defined in our current deposit schema. Most often this is done by
using the database genre as a sort of catch-all. Having recently been approached with two
more situations we’re now exploring the possibility of adding a new general purpose content
type and additional dedicated content types.
The goal here is to more accurately represent the type of content