Integration - the heart of researcher centric research data management systems - Steve Mackey, Arkivum
1. Integration – the heart of
researcher centric research
data management systems
Steve Mackey
15 January 2015 1
2. Agenda
• Who we are, what we do
• How it works
• RDM systems, where it fits
• Workflows
• Integrations
21 October 2014 2
3. Archive storage with a difference
Flagship Arkivum100
service with 100% data
integrity guarantee
World-wide professional
indemnity insurance –
Arkivum100
Long term contracts for
enterprise data archiving
Fully automated and
managed solution
Audited and certified
to ISO27001
Data escrow, exit
plan, no lock-in
21 October 2014 3
4. Adding media – effectively continual process
Monthly checks and maintenance updates
Annual data retrieval and integrity checks
Hardware refresh
Software migration
Hardware migration
Tape format migration – LTO n to LTO n+2
Support and admin staff migration
Change of supplier of products and
services
Keeping Data Alive for 25+ Years
3-5 year
obsolescence of
servers, operating
systems and
software
5. Arkivum Appliance
• CIFS/NFS presentation
(integrates easily to local file
systems)
• Simple administration of
user access permissions and
storage allocations
• Robust REST API for
application integration
• GUI for file ingest status,
recovery pre-staging,
security
• Ingest triggered by:
timeout, checksum
exchange, manifest (bulk).
• Checksum/fixity chain of
custody from ingest through
replication
• Immutable (WORM)
• Regular (6 monthly) data
copy read verify
• Offline Escrow data copy
(open source, self
describing)
• Data encryption throughout
keys only held by customer
21 October 2014 5
17. Workflows
• RDM Workflow - The sequence of repeatable
processes (steps) through which Research
Data passes during its lifecycle, including the
steps involved in its creation, curation,
preservation, access and eventual disposal.
21 October 2014 17
18. RDM Workflows Report
• JISC Research Data
Spring
• A Consortial
Approach to Building
an Integrated RDM
System – “Small and
Specialist”
• http://dx.doi.org/10.6
084/m9.figshare.1476
832
21 October 2014 18
20. Figshare
(Amazon)
Archive
(Arkivum)
Researcher
8. Data DOI
2. Data files
Local Research Data
5. Data DOI
DataCite (BL)
HR
system
1. Researcher
details
Web
browser
4. Mint DOI
3. Data Description
Journal7. Article
CRIS
(Elements)
6. Data
DOI
12. Dataset
Description and Data DOI
9.Article and
Article DOI
14. Data files
Repository
(DSpace)
10. Article and Article DOI
13. Dataset Description
And Data DOI
Article DOI
16. Data
is safe
15. Data
is safe
11.
Article
DOI
21. Why integrate?
• Simpler and easier RDM processes from a Researcher perspective, which both
encourages adoption and lowers the cost of institutional support to the research
base.
• Clear and repeatable RDM processes that help ensure higher levels of quality and
consistency in RDM across the research base.
• Ability to deploy RDM as community-driven shared service(s) so that smaller
institutions can ‘join forces’ to benefit from having access to a common RDM
infrastructure.
• Scaling RDM up across a large research base using automation and ‘factory’ type
approaches to achieve ‘economies of scale’ and move away from RDM being a
manual and labour intensive endeavour.
• Specifically for Archive layer storage this may include:
– Confirmation of integrity of received files via checksums/fixity
– File archive status reporting
– Trigger for original file deletion
– File location, data pool management
– File recovery staging
– Encryption key management
21 October 2014 21
These are just some of the things that will happen over 25 years of trying to retain data.
In the diagram, a change from blue to yellow is when something happens that has to be managed. In a growing archive, adding or replacing media, e.g. tapes or discs, can be a daily process, so is effectively continual. The archive system needs regular monitoring and maintenance, which might mean monthly checks and updates. Data integrity needs to be actively verified, for example annual retrievals and integrity tests. Then comes obsolescence of hardware and software, meaning refreshes or upgrades that will typically be 3 – 5 years, for example servers, operating systems, application software. The format of the data being held may need to change so it can still be read and even long-lived formats such as PDF-A will eventually be obsolete as they are replaced with something better and applications no longer provide backwards compatibility.
In addition to technical change, there will be the need to manage staff transitions of those who run the system, for example support staff and administrators. And suppliers of products and services will come and go to. There are very few vendors that have been around for a long time in the IT industry and mergers, acquisitions, changes in direction and companies simply going bust are all common place.
Basically, the lifetime of the data is longer than the lifetime of almost everything that’s used to keep that data safe and accessible. The key point is that long-term archiving is an active process and there’s always some form of change going on. And when change happens there’s always a risk that something goes wrong, and there’s always the need to validate that the change has been effected properly. This all requires time, expertise and money. Digital archiving is a case of continual interventions to keep content alive and accessible.
A file is copied on to the appliance, how it gets there may very depending on the application and integration method. Its worth remembering that you should confirm the data got onto the appliance safely, some partner products perform the checksum validation to ensure the action of copying in hasn’t introduced data corruption.
The appliance watches for the file being closed (to ensure we don’t try and process incomplete files), to ensure no further changes are going to be made. it will wait for two complete ‘ingest periods’ to pass, before the process begins at which point the file is marked as ‘Red’. The duration of the ingest time is set on a per ‘data pool’ basis and defaults to ten minutes.
Multiple checksums are taken of the original file, and stored within the service.
The file is then encrypted, to ensure the efficiency of the service larger files are split into ‘chunks’ up to 1GB in size before being encrypted. A key can be set at any point in the file try and applies to any object below that point. It is important to note that a custom must be applied to a folder before any data is add below it. Any keys that are used with the service must be kept safe by the client, as Arkivum never have access to theses. In addition to keeping digital copies of the keys it is also recommended that a hardcopy is made and stored securely. Without the keys, it would impossible to retrieve data from the service.
An encrypted version of the file is created and then immediately decrypted, and compared with the original. If the encrypted archive is validated, the decrypted copy is removed and multiple checksums of the validated archive are taken and passed for replication into the service.
The archive is replicated to our first datacentre, once the transfer has complete its integrity is confirmed using the checksums created earlier.
The archive is then replicated on to our second datacentre, where again the integrity of the transfer is confirmed using the checksums.
Once we have two validated copies in the service, the status of the file is updated to ‘amber’. The file is pretty well protected at this point but the 100% guarantee does apply until we reach the ‘Green’ state.
A third copy is queued to be written for escrow, the tape is not written until a complete tapes worth has been queued. Currently this is 2.2 TB, depending on the rate at which data is archived this can me files remain in the ‘Amber’ state for sometime. Where this risk is an unacceptable ‘escrow events’ can be purchased.
Once a tape written, and verified, it scheduled to be couriered to the escrow site. Once a receipt confirming it safe arrive has been received then the status is updated to ‘Green’. At this point the 100% guarantee comes into effect.
Only now is it safe for any copies of the file outside of the service to be disposed of, or for it to be excluded from any conventional backups.
The validated archive remains in the appliance cache but is now marked as being available for deletion as when the cache high water mark is reached.
But more than just archiving is required of course to achieve these benefits.
This is a diagram from the University of Edinburgh RDM blog from just before Christmas. It shows the components required, including:
A Current Research Information System (CRIS) for tracking grants, projects, equipment, research results, etc.
A Data Asset Register, which might an Institutional Repository, which provides a public gateway to research done at an institution, both publications and data.
Then there are the multitude of public data repositories where open data can be deposited
And finally a Data Vault as a safe storage facility for research data at various stages in its lifecycle.
http://datablog.is.ed.ac.uk/2013/12/06/the-four-quadrants-of-research-data-curation-systems/
One data centric way to look at Research Data Management is to consider the processes and infrastructure when research data is created and used, which is the ‘research’ side of the diagram, and the processes and infrastructure that is also needed so that some or all of this research data can be kept and made accessible for future reuse, which is the ‘reuse’ side of the diagram. You’ve got live, active and changing data on the left and then curated, retained and highly managed data assets on the right.
Traditionally, Researchers occupy the left hand side and the Library, Research Office etc. occupy the right hand side.
Research Data Management spans the whole space as it covers all aspects of the data lifecycle and should be considered as part of Good Research Practice and hence part of what Researchers do as a matter of course. We might not be there yet, but this is where I think we’d like to be.
It’s also true that the boundaries are likely to get blurred as increasing amounts of research are data-driven based on existing and shared data sets.
One of the challenges comes when thinking about all the tools and systems involved.
So, for example, on the LHS, you might be using a CRIS when developing and bidding a project. When the project is live, Researchers might be using their own devices, collaboration and sharing platforms, lab systems and a host of other tools or platforms to do their research. There might be HPC systems to process data, or do simulations and modelling, and if data sets are large there could be big data analytics and other funky stuff. At some point, publications are made and the outcomes of the work are released.
Then comes the question of what to keep, why, who for, and everything needed to ensure that enough context is captured for any data that should be retained for future use. Data might be kept because its needed for repeatability and verification of the research, or it might be kept because it has value to the researcher or others in future research.
Tied in with publication, access and meeting funding body requirements are things like minting DOIs, adding records to IR and storing data in vaults or other facilities that ensures the data is held safely and securely for future access. Then comes activities around ensuring data remains usable, which is digital preservation, that access and retention continues to meet policies, and then finally, and last but certainly not least, that use and citation of the data is tracked so impact can be assessed and decisions on whether to continue keeping it. This might feed back into the CRIS, e.g. for REF, and also for further selection/curation. And again this is an ongoing and cyclic activity.
What we’re seeing in working with a wide range of Universities is the challenge of how to make these circles meet and work smoothly together. You can’t expect the library or research data service part of an institution to get intimately involved with all the ways in which data is created and used. Likewise, you can’t expect Researchers to have to know, understand and use a whole host of systems and tools on the long-term research data management side, i.e. the right.
What we’re seeing is a desire and need for the simplest interface between the two, a kind of meeting in the middle, which provides a very simple solution for the Researchers. Almost like a one-stop-shop – and crucially one that has value to the Researcher so helps motivate their engagement with Research data management. For example, helping them get more citations, downloads, collaboration requests based on their data.
And its this simple one stop shop and clear process is what I think is so interesting about the Loughborough approach.