Presentation given by Kate Fernie at the Big Data in Archaeology conference in March 2019. The presentation covers the background to European initiatives to connect monument and building inventories with museum collection databases, introduces CARARE and its work to aggregate a diverse range of archaeological datasets for Europeana, the development of the CARARE metadata schema, the process of metadata mapping, the challenges and opportunities for normalising and enriching the provided metadata to increase its discoverability in the multilingual context of Europeana.
2. • CARARE
• A brief history
• Datasets and their diversity
• Metadata and schemas
• Challenges
• Possibilities
Introduction
3. CARARE
Connecting Archaeology and Architecture in Europe
• Began as an EU-funded best practice network in 2010
• Established as a membership association in 2016
• Objective: Advancing professional practice and fostering appreciation
of the digital archaeological and architectural heritage
• Areas:
• Good practices, advice and guidance
• Services to enable data sharing
• CARARE metadata schema
• Promoting re-use
http://www.carare.eu/
4. Steps on the way to CARARE
• A shared vision
• International collaborations on
heritage data (CIDOC, Arena,
Acquarelle, DARIAH, INSPIRE,
Europeana, etc.)
• Digitisation and use of digital
technologies
• GIS
• Technical infrastructures
A brief history
5. Who is collecting archaeological and architectural heritage data?
• State agencies
• inventories of protected sites, monuments and buildings
• conservation records, field investigations, surveys
• Museums – finds and excavation archives
• Research Institutions & researchers
• Libraries
Datasets
Image: Swedish National Heritage Board
6. CARARE and related projects have aggregated over 6 million digital
objects from 20+ countries for Europeana.eu
Many different types of object
• Inventory records, reports, photographs, drawings, books, videos, objects,
aerial photos, GIS datasets, 3D datasets, models, reconstructions, and more
Many different ways of recording objects
• Heritage agencies, museums, archives, libraries, researchers all have
different ways of describing objects
Many different languages, vocabularies, time periods and map systems
Rather diverse
7. Tournoi royal de motos à Londres changement
d'une roue de side-car en marche, 1932
Agence de presse Mondial Photo-Presse.
We work with
the metadata
that’s provided
8. CARARE defined a metadata model for metadata aggregation
• Standards based: CIDOC core standards, MIDAS Heritage, LIDO and EDM
• Distinguishes between “heritage assets” (monument, building, painting, book,
image, film, 3D) and digital representations found online
• Allows for events (field activities, lab work) and collections
• Supports objects that are composed of other objects (complexes and
hierarchies)
• Is rich where the domain calls for it (e.g. time, space, monument character)
The schema meets a need to mediate between native data (exports) and enable
their transformation into a common format
Combining datasets
9. Let’s see an example
MINT
• Metadata mapping (from
native to target schema)
• Preview
• Statistics
• Transformation (to target
schema(s))
Rijksdienst voor het Cultureel Erfgoed:
Rijsmonmumenten
10. Making connections
Heritage asset
Has
representation
Images: Instituto Universitario de Investigación en Arqueología Ibérica
“Hornos de Peal, Jaén”
Has
representation
is related
Relationships between the main CARARE classes:
• Heritage asset, digital resources and events
Has Met
11. Enriching metadata during mapping
Heritage asset
Images: Instituto Universitario de Investigación en Arqueología Ibérica
“Hornos de Peal, Jaén”
<car:heritageAssetType>http://vocab.getty.edu/aat/300054328</car:heritageAssetType>
<car:heritageAssetType>http://vocab.getty.edu/aat/300000810</car:heritageAssetType>
<car:heritageAssetType>http://vocab.getty.edu/aat/300305500</car:heritageAssetType>
Adding constants: LOD
AAT concepts
<car:heritageAssetType lang="es">Necrópolis</car:heritageAssetType>
Languages identification
Mapping the metadata gives an opportunity to
make some simple enrichments, by adding:
• Language of the metadata
• Name of the provider
• Country of provider
12. There’s a difference between doing a schema mapping and a mapping to
transform real data.
Data issues can include:
• Data that doesn’t conform entirely to the scope of an element
• Multiple values within a single element (separators)
• Data inserted in mandatory elements (n/a)
• Lack of unique values
A good mapping can address some of these issues, e.g. by splitting
multiple subject concepts into separate elements.
(issues can be fixed at source, but this can be time consuming with datasets that
include hundreds of thousands of records).
Quality issues
13. Transformation: some semantic gains
Through transformation to a
common schema, we achieve
interoperability between
disparate datasets
Enabling cross searches
(what, when, where, who)
Open licencing of the
metadata and APIs enables
reuse in various applications
http://eculturemap.eculturelab.eu/eCulture14m/Map.html?
14. • Metadata mapping is rarely easy
• Metadata models are complex with subtle difference in world view
• Statistical metrics can show that recording practices diverge and other
quality issues
• Native metadata is designed to serve specific purposes
• Local context, audiences and questions
• Merging metadata from various organisations in different
countries/languages poses special challenges
Some challenges
15. Aggregators like CARARE enable transformation of metadata into a
common model and have some services to enable further work
• Language labelling
• Adding Linked Open Data
• Automatic enrichment
• Crowdsourcing
Aggregating and enriching
MORe
16. One of the big challenges in searching across datasets in Europe is
dealing with data in different languages
Linguistic resources and translation tools are increasingly available, but to
work they need first to identify which language is involved
Language labels are often missing
Language identification and labelling microservices
Interfaces, displays and search services can adapt to users’ preferred
language and in this way return results which are relevant but which have
been catalogued in unfamiliar languages.
Why add language information to data?
17. CARARE microservices include:
• Natural language processing techniques to enable subject concepts
and names to be extracted from text
• Geocoding services to add coordinates for named places
• Vocabulary matching services
• Geo conversion, inversion and normalization services
Automated enrichment
18. Location case study
• Location is important for archaeology but place information is often
missing, especially for content from library, archive and museum
collections
• Automated extraction techniques can identify place names in data, but
place names are not unique
• The process requires quality control
• Crowd sourcing is one way of harnessing the knowledge of individuals
to check the results of automated enrichment and place objects
correctly on the map
• One such service was developed by the LoCloud project
Crowd sourcing
21. Is it big data?
• Volume – 2-4 million assets aggregated by CARARE
• Includes the national heritage inventories for several
countries, which are individually quite large datasets
• Europeana includes another 1 million+ assets relevant for
archaeology aggregated by other projects
• Includes museum and library collections, film archives,
newspaper reports
• Quite big?
• New research would be great!