2. Outline
1. The context – drivers for preservation
2. The problem – challenges faced when trying to re-
use data
3. Our solution – metadata for data management
&preservation
4. Our recommendations– strategies for making the
right metadata choices
2
9. Common challenges
to re-use/preservation of any type of digital object
I can’t find it
I can’t open it (wrong hardware/software)
I’m not sure it is the right thing
11. Unique challenges
to re-use/preservation of structured data
I’m not sure it is the authoritative data
I don’t understand the meaning of the data - data is
not self-descriptive
I can’t use the data because I can’t harmonize it
with other data
11
13. Our solutions
Have subject Archivists put
I can’t find the data
(common) experts record it in a safe
locations place
I can’t open the data
(common)
I’m not sure it’s the
right thing / it’s the
authoritative data
(particularly hard
with data)
I don’t understand
the meaning of the
data (particularly
hard with data)
I can’t reuse the data
because it’s not
harmonised (unique
to data)
13
14. Our solutions
Have subject Archivists put
I can’t find the data
(common)
experts record it in a safe
locations place
Archivists
I can’t open the data
(common) monitor file
formats
I’m not sure it’s the
right thing / it’s the
authoritative data
(particularly hard
with data)
I don’t understand
the meaning of the
data (particularly
hard with data)
I can’t reuse the data
because it’s not
harmonised (unique
to data)
14
15. Our solutions
Have subject
I can’t find the data Archivists put it in a
experts record
(common) safe place
locations
I can’t open the data Archivists monitor
(common) file formats
I’m not sure it’s the
right thing / it’s the Subject experts &
Have subject
archivists capture
authoritative data experts identify key
what has happened
(particularly hard datasets
to the data
with data)
I don’t understand
the meaning of the
data (particularly
hard with data)
I can’t reuse the data
because it’s not
harmonised (unique
to data)
15
16. Our solutions
Have subject
I can’t find the data Archivists put it in a
experts record
(common) safe place
locations
I can’t open the data Archivists monitor
(common) file formats
I’m not sure it’s the
right thing / it’s the Subject experts &
Have subject
archivists capture
authoritative data experts identify key
what has happened
(particularly hard datasets
to the data
with data)
I don’t understand
Have subject
the meaning of the Archivists capture
experts capture
data (particularly or QA metadata
important data
hard with data)
I can’t reuse the data
because it’s not
harmonised (unique
to data)
16
17. Our solutions
Have subject
I can’t find the data Archivists put it in a
experts record
(common) safe place
locations
I can’t open the data Archivists monitor
(common) file formats
I’m not sure it’s the
right thing / it’s the Subject experts &
Have subject
archivists capture
authoritative data experts identify key
what has happened
(particularly hard datasets
to the data
with data)
I don’t understand
Have subject
the meaning of the Archivists capture
experts capture
data (particularly or QA metadata
important data
hard with data)
I can’t reuse the data Archivists and
Tools to create
because it’s not subject experts
more standardised
harmonised (unique capture detailed
data
to data) metadata
17
18. To support these processes…
Metadata is key
We could invent our own standard for recording
metadata but there is a better way …
18
20. Comparison of standards coverage
Dublin Core DDI PREMIS
Discovery information Surveys and outputs Objects (significant
about a resource (e.g. (Series and Studies) characteristics,
Title, Creator, Publication checksums, basic
date) identifying information)
Methodology & quality Events (preservation
information actions)
Classifications used Agents
Dataset descriptions Rights
Variables used
Links to documentation
20
21. Metadata to support re-use
I can’t find the Have subject
Archivists put it in a
data
experts record
safe place DDI
locations
PREMIS
I’m not sure it’s Have subject
Subject experts &
archivists capture
the authoritative experts identify key
what has happened
datasets
data to the data
I don’t Have subject
understand the experts capture Archivists capture
meaning of the important or QA metadata
metadata
data
I can’t open the Archivists monitor
data file formats
I can’t reuse the Archivists and
Tools to create
subject experts
data because it’s capture detailed
more standardised
data
not harmonised metadata
21
23. Metadata Top Tips
1. Create structures that will allow you to re-use metadata
tools
2. Use standards that are fit for your content so users can
re-use
3. Consider overlap between standards so you’re using the
right standard for the right job
4. Provide standard based tools and capture at point of
creation to improve quality and efficiency
23
24. 1. Create structures that will allow you
to re-use metadata tools
Set yourself up to be able to use the same tools to
harvest and mine your metadata (e.g. handy reports,
searching across content types) by:
– developing a standard structure that can support all your content
types
– and recording generic information in generic metadata standards
24
25. Data_1500 Database_0120
DublinCore.xml Non-format DublinCore.xml
PREMIS.xml specific metadata
PREMIS.xml
Original Original
data.sas7bdat database.mdb
questionnaire.doc ArchiveMaster
ArchiveMaster Header
Data metadata.xsd
data.csv Format metadata.xml
specific structure &
Documentation metadata Content
questionnaire.pdf Schema1
Metadata Table1
DDI.xml table.xsd
table.xml 25
26. 2. Use standards that are fit for your
content so users can re-use
Enable future re-use and understanding by recording format
or content-specific metadata in fit-for-purpose standards e.g.
DDI for statistical data
SIARD for databases
MIX for images
26
27. 3. Consider overlap between standards so
you’re using the right standard for the
right job
Information DDI PREMIS Dublin Core Useful to
duplicate?
Basic identifying •Title •Title yes
information •Creator •Creator
•PublicationDate •Date
•ID •Identifier
Access •Access Conditions •Rights entity •Rights No – PREMIS is
information most expressive
and generic
location
27
28. 4. Provide standard based tools and
capture at point of creation to
improve quality and efficiency
At first, you may need to capture or collate all
metadata about data yourself
Think ahead about tools you might be able to
provide to data experts to allow them to record the
information directly in the standard if possible
28
30. Takeaways
1. Organisations have many reasons to re-use data over time
2. There are unique challenges to preserving data
3. Where possible, save yourself some work and make your
metadata more harvestable and data more understandable by
using international standards like DDI and PREMIS
4. When you use metadata standards like DDI and PREMIS together:
• create generic structures
• use fit-for-purpose standards for specific content
• consider information overlap
• ‘delegate’ metadata capture where possible
30
Much of the important information about the world we live in today is recorded as structured data rather than unstructured documentation. Structured data is diverse in content and expression- ranging from commercial databases containing client information to geospatial and scientific research datasets. As structured data, such as statistical data, contains important information that scientists, businesses, and researchers may want to reuse in the future, there is an increasingly urgent push for its preservation.Preservation and re-use of data requires that data be described with appropriate metadata that will allow future users and machines to discover and interpret it. Organisations who want to preserve data must make a series of choices about how to describe it using the right combination of standards. In this presentation, we will use the Statistics New Zealand Data Archive as a case study for examining the point of connection between a statistical metadata standard that supports active data management (DDI) and a metadata standard that supports preservation (PREMIS). We will share our experience in using DDI and PREMIS to describe statistical data and will highlight how data-specific metadata can be used to support long-term preservation.
We live in a data-driven society today. We’ve got vast quantities of geospatial data driving systems like Google Maps, we’ve got data-intensive sciences like astronomy that work with petabytes (1000 terabytes) of data, national statistical organisations like Stats NZ regularly collect data from individuals and businesses across the country to enable a better understanding of our society, there’s swathes of online data collected everyday by companies like Amazon and Facebook to help drive marketing decisions… and all this data is extremely useful and valuable.Image credits: http://rifm.org/default.htmhttp://www.stats.govt.nz/browse_for_stats/snapshots-of-nz/nz-in-profile-2012/~/media/Statistics/browse-categories/snapshots-of-nz/nz-in-profile/2012/nzip-2012-food-prices.PNG
For statistical organisations like Stats NZ the primary driver for preservation of data is re-use of expensive data collections to answer questions that demand longitudinal data Image credits:http://www.stats.govt.nz/Publications/MacroEconomic/productivity-stats-sources-methods.aspx
I can’t find itI can’t identify the objectI can’t open it because I don’t have the right software/hardware or the object or media is damagedI’m not sure it is the right thing (i.e. is it the authoritative version? has it been changed along the way?)Image Credit: http://www.envelop.eu/shop/patterns/details/p/red-green-and-blue-apples
- researchers havelots of iterations of datasets during processing-data uses codes I don’t have any documentation on, what are the variables measuring, what events during the collection phase could have affected the quality of the data, who was surveyed, cryptic variable names, unsure of weighting applied, sources used
Some of the solutions are more about statistical information while other are about those common preservation or re-use problemsCould have cumbersome org-specific standard but better to have combination of international standards
Use a combination of international standardsThere are a few great benefits of this:This helps us and could help you by saving you from re-inventing the wheel.International experts and great community at your disposalInteroperable data and makes it easier to create shared access points and search and mine across data repositoriesTo describe data, particularly statistical, we use DDI , which is a fairly complex standard for managing and describing dataDDI includes Dublin Core information like titles and creators or authors that helps users find dataPREMIS contains information that will help the archive preserve the data via checksums, file formats, and provenance information
Significant characteristics to preserve (e.g. fonts, colors, content only)How do you bring these all together? And what happens if the same information is included in more than one standard?We’ve done some thinking about this and can share our experience and strategies to consider when deciding what to record where!
Looking back at our activities, some are more content-specific, i.e. just about data, and others are more general/common preservation activities.
If you haven’t started managing your data - you can go back to your desk tomorrow and think about what metadata you could start capturing to support long-term re-use – whether you’re the one with the preservation archive or you’re planning/hoping to hand off your data to someone else. If you have already started managing your data – you can check whether your current practices consider the following things
Premis – admin-ey m/d? ddi – descriptive? Other overlap includes DDI Archive module lifecycle events – could contain same info as PREMIS events but this overlap is probably not useful
At Statistics NZ, we’re implementing a tool that will allow our statisticians to capture the statistical information as DDI.
Don’t ignore data – it’s probably a key part of your core business