Paper on a JISC-funded project based at the UK Data Archive, as presented at the GISRUK 2012 conference, Lancaster University. The project set out to better enable the use of Archive datasets in GIS, primarily by addressing metadata and quality issues of geospatial identifiers.
2. Archived survey data presents a vast
wealth of material with potential for
secondary use in GIS
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
3. UK DATA ARCHIVE
• Over 5,000 datasets
• Popular survey data series include:
Quarterly Labour Force Survey
British Household Panel Survey / Understanding
Society
British Crime Survey
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
4. We set out to explore the availability and
usability of geo-identifiers in the UK Data
Archive collection
These identifiers come in the form of
‘spatial units’ e.g. Ward and Constituency
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
5. • The availability of geo-referenced data is
ever increasing
• The usability of geo-referenced data ‘out-
of-the-box’ is still generally poor
Reflective of and contributing too a divide
between:
• GIS experts – idiosyncratic methodologies
• Untrained with interest – steep learning
curve
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
6. Three key features of ‘ready-to-link’
survey data for GIS
1. SELECTION
2. QUALITY
3. METADATA
7. 1. SELECTION
Include geographical identifiers which:
• Can be readily transformed
• Are of sufficient resolution to allow for
fine-grained analysis
• Are appropriate to the data subject
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
8. 2. QUALITY
Include geographical identifiers which:
• Use standard names
• Are coded with a standard coding scheme
e.g. ONS’ GSS Coding and Naming
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
9. 3. METADATA
Include geographical identifiers which are:
• Time-referenced
e.g. Government Office Region as defined in
2001 as opposed to 1998
• Well documented in their derivation
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
10. Those collecting data need to adjust
their workflows to enable this
Those curating data need to adjust
their workflows to enable this
11. What should data collectors be doing?
• Considering geographic identifiers BEFORE data
collection!
• Considering standards
• INSPIRE/GEMINI
• GSS Coding and Naming
• Documenting the provenance of geographic
identifiers
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
12. What will we be doing at the UK Data Archive?
• INSPIRE compliance
(we have published a metadata mapping for DDI-INSPIRE-GEMINI)
• Improving spatial unit definitions through
extensive data cleansing
Standardised
Time referenced
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
13. What will we be doing at the UK Data Archive?
• Improving resource discovery tools / interface
User friendly
Lessen time spent searching through text
Consider semantics
• Feeding back to data depositors
Guidance on best practise
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
14. U·Geo Browser
A new web tool for resource discovery
• Revised and augmented variable metadata
• Information clarifying the quality of the geo-identifier
• Integrated spatial unit definitions
• Links to boundary files
Live beta at: geo.data-archive.ac.uk
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
15.
16.
17. U·Geo Browser
• A demo tool using a simple, pragmatic approach
• This tech will be integrated into a central Archive resource
discovery tool, and catalogued data will be updated to
reflect these refinements
-
• A step in the right direction but we need formal semantics
built on persistent vocabularies
• A drive needed to establish this
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
18. Thanks to:
• all those at the UK Data Archive
• to EDINA for their contributions as consultants
Tom Ensom
tensom@essex.ac.uk
wwww.data-archive.ac.uk
@UKDataArchive
UNLOCKING THE GEOSPATIAL
POTENTIAL OF SURVEY DATA
Hinweis der Redaktion
Based on JISC Geospatial funded work at the UK Data Archive
Archived survey data has great potential for secondary analysis in GIS, a potential which is not yet fully realised. The UK Data Archive,as distributor ofthe UKs largest collection of social survey data, is well positioned to spearhead developments in this area.
We curate the largest collection of digital data in the social sciences in the UK. Over 5,000 datasets from government departments and research institutes and other organisations, all of which are made available online to UK academia.Some of you might be familiar with our datasets, some of the more well known series include the QLFS, BHPS and BCS.
The UGeo project looked in depth at this survey data, much of which contains geographic variables of some kind. We wanted to assess the quality and condition of the identifiers and the metadata describing them.
First part of project a systematic information gathering exercise, working through datasets one by one and pulling out the geography variables for further examination. A observation we were quickly able to make was that the availability of geo-referenced data has been steadily increasing, particularly over the last 10 years. Not only are new studies being geo-referenced, but new varieties of identifier are being added with uses for different disciplines. Lower level geographies such as postcode and grid reference have also been increasingly made available, thanks to the advent of new licensing options and secure data services.However, the actual state of the variables and their metadata is still relatively poor. For example: timestamps are often missing making appropriate linking impossible; inappropriate units are used prohibiting meaningful analysis
The next stage of our project then, was to work how exactly to remedy some of these data problems so evident in our investigation. What exactly are we looking for in ready to use georeferences? We suggest a three part criteria:
Selection is the choice of geographic identifier.Ideally of a sufficiently low level that they can be transformed to any other variable e.g. grid reference, postcodeAppropriate for analysis – e.g. statistics-appropriate units such as output areaShould be appropriate to the data subject e.g. researchers are likely to want parliamentary constituencies for a political survey, police force areas with the BCS
How easy it is to unambigously interpret the variable and codes:Use standard names for units e.g. the term Scottish Region could refer to administrative or electoral regions – so disambiguate them in the nameUse standards such as GSS Coding and Naming scheme produced by the ONS which provides a standard set of codes for each division of many popular spatial units
Ensure any spatial unit is well documented:A timestamp for each variable, for example Government Office Region as defined in 2001 as opposed to 1998Sufficient documentation of provenance. For example, if you’re including a grid reference, how was it derived? Postcode centroids?
In order to meet this criteria, there are new approaches needed in many stages of the pre-analysis data lifecycle, from both those gathering the data and going on to deposit it, and from those who preserve and disseminate the data such as the UK Data Archive
Briefly, what should those collecting data be doing? This has relevance to those working on research projects as well as big government surveys. - Instead of tacking on geo identifiers they should be considered prior to data collection, and asking which units and why?- Using data standards at the collection stage- Documenting how the unit been derived in precise terms
What are we doing to make the lives of researchers easier? The UK Data Archive will be leading the way in new developments for archives INSPIRE is an EU standard which helps to ensure a minimum level of information about the geospatial content of a dataset.A number of survey data / spatial unit specific improvements will be required. Much data cleansing work on our catalogue will be taking place over the coming months to bring it up to scratch
3. Using the enhanced metadata, we will try and make it easier for users to find the data they need. We will be considering interface design and making the relevant documentation easier to find. All this will consider the semantics of the unit – dataset relationship4. And finally we will of course be encouraging data depositors to give us better geospatial data
An immediate development of the project has been a web tool called the UGeo Browser. This is a demonstration tool, that brings the geospatial to the forefront of searchable metadata. It meets many of the requirements I have just outlined, for a subset of our survey data collection:Revised and augmented variable level metadata to ensure accuracy and completenessExtra quality information – e.g. this variable is Ward, but it’s missing value labelsClear and immediately accessible unit definitionsVerified links to boundary files, with divergence (if any) between dataset and boundary clarified
Interface preview
Interface preview
In many ways this functions as a proof of concept for our ideas on how ‘studies’ and ‘units’ as entities should interact. The long term goal is that this tech will be integrated in the Archive’s central catalogue. Data cleansing and application development work has already begun.We’re also now considering the best way of creating formal semantics between units and studies. Perhaps a first step will be persistent identifiers for units…