The National Archives of Australia faces challenges in managing digital records at scale, including multiple formats, proprietary formats, metadata extraction, storage, and access. The project "Chrysalis" aims to transform the digital business of the Archives by designing systems for complexity and scale through automation, machine learning, and standardization. The project will also establish an "Archives Point of Presence" within agencies to facilitate record transfers and access in an iterative process involving industry and whole-of-government engagement.
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Project Chrysalis – Transforming the Digital Business of the National Archives of Australia. Zoe D’Arcy
1. naa.gov.au
Project Chrysalis –
Transforming the Digital
Business of the National
Archives of Australia
Zoë D’Arcy
Director Business Systems and Online Services
National Archives of Australia
5. Challenges around ‘Digital’
Prepare
Multiple Formats
Structured and
Unstructured
Static and Dynamic
For humans and for
computers
Size of each item and
volume
Ingest
Proprietary Formats
Transfer
Reliability of Storage
Metadata Extraction
and Generation
Security Threats
(Viruses)
Authentication and
Access Control
Preserve
Normalising and
Standardising Formats
Licensing for
Commercial Formats
Storage of Structure
Data
Items that link to
external services
Capturing behaviour
as well as structure
Manage
Storage of Resources
that can be arbitrarily
copied and moved
electronically
Dynamic content
could ‘change itself’
Dependency
management
Copyright and
Intellectual Property
Access
Classification
Copying
Distribution
Size
Decisions around
access
Authentication and
Access Control
Validating Generated
Data and Metadata
naa.gov.au
The model envisioned by Project Chrysalis is that the Archives will develop, implement and maintain services for government agencies to enable the identification and extraction of RNA (Retain as National Archives) records within the Agencies themselves. This could be assisted by creating a tool/standard for EDRMS vendors that could be incorporated into recordkeeping software. This tool/standard would help classify records according to Records Authorities and our Shared Service Management Console. This is discussed further in the next section ‘Providing Services for Government Agencies’.
Commonwealth Agencies will have their record producing and management systems integrated into the Archives Distributed Digital Record System (ADDRS). Records will be exported and/or harvested, batched and transferred to the Archives via an automated process over the most appropriate channel.
The records will then be quality assured and stored within the Store Cloud for further processing.
Clients will search for the records via a find and display function. The records will be retrieved by a workflow running within the workflow controller. The workflow will retrieve the records, converting them into an appropriate access format.
The retrieved records will be published to a delivery platform most appropriate for the client. For example, a small digital file for consumption over the Internet, would be published to the Local Content Distribution Server and be accessible via a web download; whereas a large video may be pushed to the Content Distribution Network for retrieval by the client using a streaming video player.
The National Archives actively promotes the use of two metadata schemas for use of Australian government agencies – AGRKMS for government digital records, and AGLS for government websites – both are based on Dublin Core. Technically, however, the Archives has to support a logical metadata model that allows the ingest records that conform to the much wider range of metadata schemas that are in common use amongst government agencies. We have to store those records; manage and automate business processes that enhance a record or move it from one state to another; and also have a searchable index of the records.
Automation Metadata – this data is required to support the management component, and will be managed in a Relation Database Management System. This answers questions like: What state is the information package currently in? Why did the information package change state? Who currently owns it? What format is it in? What is the security level? This information must be accurate and unambiguous as it will be used by the computer system to orchestrate and perform transactions on the information package itself.
Description Metadata – this is the data that is required by the index in a full-text search engine to allow the information package to be found and retrieved from the storage. This may include discovered/derived information, descriptions, annotations, transcriptions, summaries, extra context and textual content.
Content – this is the actual information package. It is the information package that is stored within the object store. It must be able to be retrieved, based on a unique identifier, and be in a format that can be read by the end user. It may also include additional information that has been added before, during and after transfer of the Information Package to the Archives.
These layers are not expected to be distinct or static. As business processes change, it is possible that new or different automation metadata will be required. As information packages are described or new types ingested then new descriptive metadata will be required. And, of course, as new applications and technologies are used by our client then new content will be coming in to the Archives.
One of the features of Project Chrysalis architecture is the use of business rules to automate as many of the National Archives workflow processes as possible. While human decision-making will always be completely necessary for the Archives’ technical solution to work, automation of certain processes will allow scalability and sustainability.
The diagram below shows an automated Records Extraction process - the preservation workflow.
At several points in the workflow processes, Project Chrysalis looks to using machine learning tools for assistance with the scale of digital records. For instance, we know that one of the key challenges for our government agency clients is that the National Archives does not want to take all of their records – only those that are classed as ‘Retain as National Archives’ (RNA). The current process of records selection is very manual. We have prototyped a tool for use by staff within those agencies to search across records holding systems for RNA records, and begin training the tool which records do and do not fall into that category, so that they can quickly be assisted in this classification process.
Project Chrysalis provides the opportunity for the Archives to rethink how it ensures that RNA (Retain as National Archives) records are identified and transferred to the Archives.
Currently the Archives provides advice and training to assist government agencies in building their capabilities to manage digital records effectively. Part of the scope of Project Chrysalis is to build tools that will provide practical assistance for agencies to ensure that their information is managed appropriately as defined by their individual Records Authorities.
The Archives’ role in Government is to ensure that each government agency is creating and managing records of their business appropriately. Ultimately, the Archives wants to ensure that RNA records are transferred to the Archives for long-term preservation and access.
Figure 4 (on the following page) demonstrates our systems are they currently are. Each agency’s functions and activities are analysed in conjunction with the Archives to create a Records Authority.
The work done to then apply, configure into that Records Authority into record-keeping software and extract only RNA records is very manual and labour intensive. It’s also a very subjective process, and the Archives currently has no real way of assessing whether the work undertaken by each agency’s record-keeping staff is accurate.
Records Authorities are a key tool for the Archives to ensure that each government agency is creating and managing its records appropriately. However, applying a Records Authority to recordkeeping systems is currently a very manual, time-consuming and subjective task for records managers.
Project Chrysalis has two proposed implementation solutions to provide agencies with practical tools to identify and extract RNA records. The advantage of building these tools and providing them as a Whole of Government service on a Whole of Government network would be:
Agencies would not have to invest in expensive software solutions to implement the Archives tools – these tools would work with existing recordkeeping software
Archives would be able to ensure that all RNA records extracted would meet the minimum required metadata standard – in the required format to automate Archives business processes to ensure long term preservation and access
Allow the Archives to assist in managing Machinery of Government (MOG) changes and send notification of records freezes
Option 1
This is one option we could work towards – and in fact will be the approach we will probably have to take with the first agencies we will trial Project Chrysalis with. We supply a tool (point of presence) that will be a ‘management console’ that will sit within an agency’s system, alongside an agency’s recordkeeping software. This tool will enable the agency’s records managers to monitor, identify and extract RNA records from that agency – based on a machine readable Records Authority that is held on the Archives’ system.
Option 2
This is the second option, and potentially the more sustainable option for the Archives going forward – given the multiple software tools that currently exist and will be developed into the future that will contain records. This option has the Archives controlling a ‘Chrysalis Shared Service’, which is able to talk to the recordkeeping software within agencies. This is premised on the idea that the Archives will create a software standard that vendors can comply with, that will enable their software to understand the records authority information held by the Archives, and apply this information automatically to all records as they’re created in order to identify, classify and ultimately extract RNA records.
Interestingly, this last option would build very nicely on currently software development trends in the record keeping space. For instance, HP is integrating software (Autonomy and Control Point http://www.autonomy.com/products/control-point ) into Records Manager software that will enable their software to cluster and classify information from across business systems.
The end-to-end digital business system is to be delivered in four iterations. Each iteration will take two financial years to deliver, and includes design, software development and ICT infrastructure development. With each iteration, components of the system will be moved into production over the life of the project.
It is forecast that the software development of the business system will take eight years to fully deliver and become business as usual, while the changes required for ICT infrastructure will take four years. That time will be a period of change for the Archives’ business areas, as they start working with the digital records in the collection.