This presentation gives an overview of: 1) Fedora Commons, 2) it's current use by CLARIN B centres, and 3) the new TLA/FLAT setup that meets the CLARIN B centre requirements using the Fedora Commons/Islandora stack.
Boost PC performance: How more available memory can improve productivity
Fedora Commons in the CLARIN Infrastructure
1. Fedora Commons in the
CLARIN Infrastructure
Menzo Windhouwer
menzo.windhouwer@meertens.knaw.nl
Meertens Institute, TLA, CLARIN ERIC
2. Overview
1. An overview of Fedora Commons (3.8.1)
2. Current usage by CLARIN centres
3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora
Commons and Islandora
3. Overview
1. An overview of Fedora Commons (3.8.1)
2. Current usage by CLARIN centres
3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora
Commons and Islandora
4. Fedora Commons
• fedora-commons.org
• 300 registered installations
• 1997: started as a research project at Cornell University
• Implemented as a Java servlet
• 2009: joined the DSpace foundation (now DuraSpace)
• 2014: Fedora Commons 4 released
• More RDF-based
• Not backward compatible qua functionality, e.g., APIs
• Data migration utilities available
• 2015: last Fedora Commons 3 release (3.8.1)
• wiki.duraspace.org/display/FEDORA38/
• github.com/fcrepo3
• focus
5. Fedora Commons main features
• Digital Objects
• Content Model Architecture (FOXML)
• Datastreams
• Relationships between Digital Objects (RDF)
• APIs (REST/SOAP)
• Access
• Management
• Security (XACML)
• Access control
• Policies
• Message queue
• OAI-PMH
• Replication & mirroring
• Versioning
• Checksums
6. Fedora Commons main features
• Digital Objects
• Content Model Architecture (FOXML)
• Datastreams
• Relationships between Digital Objects (RDF)
• APIs (REST/SOAP)
• Access
• Management
• Security (XACML)
• Access control
• Policies
• Message queue
• OAI-PMH
• Replication & mirroring
• Versioning
• Checksums
7. Digital Objects - Model
“Fedora uses a "compound digital object" design
which aggregates one or more content items
into the same digital object. Content items can
be of any format and can either be stored locally
in the repository, or stored externally and just
referenced by the digital object. The Fedora
digital object model is simple and flexible so
that many different kinds of digital objects can
be created, yet the generic nature of the Fedora
digital object allows all objects to be managed in
a consistent manner in a Fedora repository.”
8. Digital Objects – Content Model Architecture
1. Data Object
• “Data objects are what we normally think
of when we imagine a repository storing
digital collections. Data objects can
represent such varied entities as images,
books, electronic texts, learning objects,
publications, datasets, and many other
entities.”
2. Content Model Object
• “[A]cts as a container for the Content
Model document which is a formal model
that characterizes a class of digital
objects.”
3. Service Definition Object
4. Service Deployment Object
9. Digital Objects - Datastreams
• “The content represented by a Datastream is treated as an opaque bit
stream; it is up to the user to determine how to interpret the content (i.e.
data or metadata).”
• Where does this bit stream live?
1. Internal XML Content
“the content is stored as XML in-line within the digital object XML file” (FOXML)
2. Managed Content
“the content is stored in the repository and the digital object XML maintains an internal
identifier that can be used to retrieve the content from storage”
3. Externally Referenced Content
“the content is stored outside the repository and the digital object XML maintains a URL that
can be dereferenced by the repository to retrieve the content from a remote location”
4. Redirect Referenced Content
“the content is stored outside the repository and the digital object XML maintains a URL that is
used to redirect the client when an access request is made”
10. Digital Objects - Relations
• Relationships between Digital Objects
• Collections, compounds, cross references, …
• Using the Fedora relationship ontology
• Domain specific relationships
• Encoded in RDF
• RELS-EXT: relations from the DO to other DOs or external resources
• RELS-INT: relations from datastreams in the DO to other resources
11. Digital Objects - FOXML
<foxml:digitalObject PID="lat:1839_00_0000_0000_0016_7E07_7" xmlns:foxml="info:fedora/fedora-system:def/foxml#" …>
<foxml:objectProperties>
<foxml:property NAME="info:fedora/fedora-system:def/model#state" VALUE="A"/>
<foxml:property NAME="info:fedora/fedora-system:def/model#label" VALUE="deerhunt"/>
</foxml:objectProperties>
<foxml:datastream ID="DC" STATE="A" CONTROL_GROUP="X">
<foxml:datastreamVersion ID="DC.0" FORMAT_URI="http://www.openarchives.org/OAI/2.0/oai_dc/"
MIMETYPE="text/xml" LABEL="Dublin Core Record for this object">
<foxml:xmlContent>
<oai_dc:dc …>
<dc:title>deerhunt story</dc:title>
<dc:description xml:lang="eng">The text was recorded at Madison University in the 1960s. The text was recorded indoors.</dc:description>
...
</oai_dc:dc>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="CMD" STATE="A" CONTROL_GROUP="X">
<foxml:datastreamVersion ID="CMD.0" LABEL="CMD Record for this object" MIMETYPE="application/x-cmdi+xml" …>
<foxml:xmlContent>
<cmd:CMD xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.1" …>...</cmd:CMD>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
…
13. APIs (REST/SOAP)
• The ‘RESTful’ APIs provide easy HTTP URLs to access (API-A) objects
and their datastreams:
1. https://www.meertens.knaw.nl/flat/objects/lat:10744_1b9e0d44_ef4d_496
c_8939_6129b5ee5b49/datastreams/CMD/content?asOfDateTime=2017-
01-27T11:30:52.732Z
2. https://www.meertens.knaw.nl/flat/objects/lat:10744_792194f7_d1fd_400
c_ab2b_9b51f4fe3907/datastreams/OBJ/content?asOfDateTime=2017-01-
27T11:31:01.207Z
Used as redirect for a handle
Notice the use of a timestamp to refer to a specific version of the datastream
• API-M provides methods to update objects and their datastreams
• Access to API-M can be limited using repository wide XACML policies
14. Security (XACML)
• eXtensible Access Control Markup Language (XACML) is a OASIS standard to encode access
control policies
“Each XACML policy defines: (1) a "target" describes what the policy applies to (by referring to attributes of
users, operations, objects, datastreams, dates, and more), and (2) one or more "rules" to permit or deny access.”
Rather cryptical and bloated language
• Repository wide policies
• Access to API-M (methods) by certain user/roles from certain IP adresses
• …
• Object specific policies
• Which users can access which datastreams
• …
• User profiles
• Plugin any authfilter in the application server
• Hardcoded users
• …
15. Fedora Commons as a basis - extensions
• Facetted search: gsearch (Solr)
• Listens to the FC message queue
• Runs an XSLT to create a SOLR document
• OAI-PMH: Proai
• Occasionally queries FCs resource index
• Can deliver other metadata datastreams than the default Dublin Core
• …
16. Fedora Commons as a basis - frontends
• Islandora
• Drupal based
• Large set of modules, relatively easy extensible
• Still based on Fedora Commons 3
• Ongoing experiments/development, e.g., CLAW for Islandora
• Hydra
• Ruby on Rails based
• More hardcoded workflow and data models
• …
• Portland Common Data Model
• Common data model (content models) so migration between front-ends/frameworks
becomes easier
17. Overview
1. An overview of Fedora Commons (3.8.1)
2. Current usage by CLARIN centres
3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora
Commons and Islandora
18. Repository solutions in use by CLARIN centres
0
1
2
3
4
5
6
7
8
9
Fedora
Commons
DSpace custom LAT GIT eSciDoc
Repository info on 20 B centres in the Centre registry
# B centres
Notes:
• Meertens: custom -> Fedora
Commons
• MPI: LAT -> Fedora Commons
• eSciDoc: Fedora Commons
under the hood
• Various C centres also run a
Fedora Commons (based)
repository
19. How happy are these centres with Fedora Commons?
• Send out a questionnaire to 9 centres: 6 responses
Do you (still) consider Fedora Commons a
sustainable repository solution for your
center?
yes no
Would you advice new CLARIN centers
to use Fedora Commons as (the basis
for) their CLARIN-compatible repository
solution?
yes no maybe
If you are member of CLARIN-D
then you probably might want
to choose Fedora, but if you're
in another country you might
want to take a closer look at
other solutions (DSpace or TLA
software).👍🏻
Depends partly
on available
technical
expertise
20. Fedora Commons versions
0
0.5
1
1.5
2
2.5
3.6 3.6.2 3.7.1 3.8.1 4
Which version of Fedora Commons does
your centre use in production?
# centres
Do you plan a move to Fedora Commons 4?
yes no maybe
benefit from
Linked Open
Data
approach;
within next 2
years
We are migrating to
version 4 right now. We
also made major
enhancements to our
front-end. We are
planning to go into
production with it within
the next months.
21. Size of the centre’s repositories
# Digital Objects:
ca. 150
2,500
3,038
10,000
33,000
# bytes:
ca. 125M (metadata only)
5G
16G
ca. 500G
Both MPI and Meertens have currently
over the 100.000 CMD records in the VLO,
which describe resources that take up
several TB (and up to 1M DOs).
Experiments did reveal problems in the FC
area, but they can be repaired
22. Community support
How helpful was/is the
documentation available
within the Fedora Commons
community?
not at all somewhat ok very much
How helpful was/is the
support by the Fedora
Commons community?
not at all somewhat ok very much
How helpful was/is the
documentation on Fedora
Commons by the CLARIN
community?
not at all somewhat ok very much
How helpful was/is the
support for Fedora
Commons within the
CLARIN community?
not at all somewhat ok very much
Unfortunately
there seem to be
no more Fedora
User Groups in
Europe...
Being one of the first centers
to use Fedora Commons, we
did use the documentation
available within the FC
community. At that time
there was not much CLARIN
documentation.
This blog entry was very useful
for us:
http://asingh.com.np/blog/fedo
ra-commons-installation-and-
configuration-guide/
an option for the case one
has never made use of the
support should have been
included
23. Frontends
0
0.5
1
1.5
2
2.5
3
3.5
none Islandora custom
Do you use a front-end, e.g., Islandora, Hydra or your own, next to
Fedora Commons?
# centres
own front-end,
based on
Django
(EulFedora)
and MySQL
We developed
our own, called
Erdo
The built-in user interface is not
adequate. You will need to
replace it with something better.
24. Additional advice
• “Let Apache httpd (or Apache Tomcat) take care for most of the
configuration (access control) and configure Fedora Commons to be
"open". Take care what to store in Fedora and what not (it can be very
unhandy to store too many data streams inside Fedora).”
• “I consider the two offered RDF query languages (SPARQL, ITQL) by
Fedora as insufficient, as both miss important features, e.g ITQL can't
use regexp search and can't sort strings numerically and SPARQL can't
use COUNT operator and also cannot sort strings numerically (at least
in version 3.6.2).”
• “For CMDI metadata, you also need the Proai OAI provider. Use the
version customised for Fedora Commons.”
25. Overview
1. An overview of Fedora Commons (3.8.1)
2. Current usage by CLARIN centres
3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora
Commons and Islandora
26. FLAT’s predecessors
• The Language Archive (TLA) at the MPI for Psycholinguistics
• long history in digital archiving, especially resources on endangered languages
• home build LAT (Language Archiving Technology)
• 2014 – now: preparing to switch to a stack that is largely based on off-the-shelf
software based on Fedora Commons + Islandora
• choice made after a INNET repository workshop and several pilots
• initial version based on scripts kindly provided by IDS
• started as EasyLAT now known as (TLA-)FLAT (Fedora Language Archiving Technology)
• doing a lot of cleanup/curation along the way from LAT to FLAT
• The Meertens Institute
• collecting valuable (Dutch) (physical) humanities resources for over a century
• digitization projects
• digital born resources
• KNAW participates in TLA and the Meertens Institute teamed up with the MPI to
modernize its setup and develop FLAT
28. TLA-FLAT base line
• Meet the, technical, CLARIN B centre requirements
• Meet the, technical, Data Seal of Approval (DSA) requirements
• Meet organization specific requirements
• Meet, at least the CLARIN B centre and DSA, requirements, as much as
possible, with the Fedora Commons backend
• frontend (technology) come and go quickly
• How far can we get using available components, configuration and a
limited level of tailor made software?
• Mainly to add support for CMDI
• Start with Fedora Commons 3.8.x and Islandora 7.x-1.x, move along with
the Islandora community to Fedora Commons 4
29. Islandora 7.x-1.x
• islandora.ca
• An open-source software framework designed to help institutions and organizations and their audiences collaboratively
manage, and discover digital assets using a best-practices framework.
• Islandora was originally developed by the University of Prince Edward Island's Robertson Library, but is now implemented and
contributed to by an ever-growing international community.
• Built on a base of Drupal (7.x), Fedora (3.x), and Solr, Islandora releases solution packs which empower users to work with
data types (such as image, video, and pdf) and knowledge domains (such as Chemistry and the Digital Humanities).
Solution packs also often provide integration with additional viewers, editors, and data processing applications.
• wiki.duraspace.org/display/ISLANDORA/Islandora
• github.com/Islandora
• github.com/Islandora-Labs/islandora_awesome
• github.com/discoverygarden
• Digital Objects are not Drupal nodes, the Islandora modules interact with Fedora Commons via an intermediate (PHP)
layer, Tuque
• In CLAW Digital Objects are Drupal nodes synchronized using Apache Camel
30. CLARIN B centre requirements
• [CLARIN-B-2] Centres need to adhere to the security guidelines, i.e. the
servers need to have accepted certificates.
• [CLARIN-B-3] Centres need to join the national identity federation where
available and join the CLARIN service provider federation to support single
identity and single sign-on operation based on SAML 2.0 and trust
declarations.
• [CLARIN-B-5] Centres need to offer component based metadata (CMDI)
that make use of elements from accepted registries such as ISOcat in
accordance with the CLARIN agreements, i.e. metadata needs to be
harvestable via OAI-PMH.
• [CLARIN-B-6] Centres need to associate PIDs records according to the
CLARIN agreements with their objects and add them to the metadata
record.
31. DSA requirements
• [DSA-10] The data repository enables the users to discover and use
the data and refer to them in a persistent way.
• [DSA-11] The data repository ensures the integrity of the digital
objects and the metadata.
• [DSA-12] The data repository ensures the authenticity of the digital
objects and the metadata.
• [DSA-13] The technical infrastructure explicitly supports the tasks and
functions described in internationally accepted archival standards like
OAIS.
32. Meertens Institute & TLA requirements
• [Home-1] The repository should support arbitrary deep collection hierarchies.
• [Home-2] The repository should support handles as persistent identifiers.
• [Home-3] The repository should work with arbitrary CMDI profiles.
• [Home-4] The repository should provide resource level access control.
• [Home-5] The repository should allow collection management to review submissions before the
resources are actually ingested.
• [Home-6] The repository should allow system management to determine the location of
resources on persistent storage, e.g., from fast access times to secure tape drives.
• [Home-7] The repository should allow the storage of arbitrary relationships between data sets.
• [Home-8] The repository should provide entry points for interaction with Virtual Research
Environments,
• [Home-9] The repository should allow for collection management oriented metadata, which
might not be public.
33. FLAT’s place at the Meertens Institute
Drupal
Islandora
Fedora Commons
Deposition
Service
(DoorKeeper)
SIP
AIP
Workspace
(ownCloud)
Virtual Research
Environment
Persistent
storage
SOLR
(MTAS)
Backups
(EUDAT)
Collection
Management
Infrastructures
(CLARIN)
SWORD
CMDI SP
OAI-PMH
💡
💡
💡
💡
💡
34. FLAT’s place at the MPI/TLA
Drupal
Islandora
Fedora Commons
Deposition
Service
(DoorKeeper)
SIP
AIP
Workspace
(ownCloud)
Deposition
UI
Persistent
storage
Backups
(DANS)
Infrastructures
(CLARIN)
SWORD
CMDI SP
OAI-PMH
💡
35. FLAT modules
• Core
• Fedora Commons and Islandora setup
• CMDI Solution Pack
• CMD to FOXML conversion
• Proai setup
• Indexing (SOLR)
• gsearch-based solution for CMDI
• Meertens’ CMDI indexer
• SWORD 2.0
• Reuses a deposit via SWORD approach and implementation by DANS
• DoorKeeper
• Deposition UI
• IMDI conversion
• Shibboleth
Shibboleth setup is very
server specific, so there is
a module that illustrates
the Drupal setup and can
be combined with a test
IdP.
36. CMDI Solution Pack
• Registers a metadata renderer in Islandora
• Triggers when a Digital Object uses the CMDI content model and
renders the CMD datastream
• The default render XSLT can be overwritten by profile specific XSLTs
• Not FLAT specific, i.e., could be reused outside of FLAT
37. Archival Information Package (AIP)
isMemberOfCollection
isMemberOfCollection
Collection + CMDI
CMD
RELS-EXT
DC
Collection
DC
Image
OBJ
RELS-EXT
DC
OBJ
RELS-EXT
DC
Collection + Compound + CMDI
CMD
RELS-EXT
DC
Compound + CMDI
CMD
RELS-EXT
DC
Video
OBJ
RELS-EXT
DC
isMemberOfCollection
isMemberOfCollection
isMemberOfCollection
isConstituentOf
isConstituentOf
isConstituentOf contentLocation
contentLocation
contentLocation
isMemberOfCollection
FLAT reuses a lot of
Islandora’s content
models so rendering is
easy. And they can be
easily taken along
without Islandora.
38. FLAT’s DoorKeeper
• A configurable chain of actions that
• Validate the CMDI, also according to centre specific requirements
• Check the validity of resources against preferred formats (FITS)
• Assess metadata quality
• Offer the SIP for evaluation to collection management
• Move new resources from a temporary workspace into persistent locations
• Expand WebACL to XACML
• Version management
• Assign and create handles (EPIC)
• Interact with Fedora Common’s API-M
• Trigger indexing
• Create backup bags (for DANS or EUDAT)
• Creates user and develop oriented logs
• Interaction via a REST API or the command line
• Uses dynamic class loading, i.e., easily extensible with centre specific actions
• Not too FLAT specific, e.g., usable by other repository setups or replace Fedora by DSpace
Actions are, in
general, lean and
mean, so its relatively
easy to implement
one in Java.
39. Submission Information Package (SIP)
• A CMD record referring with
• relative paths to resources within the package
• absolute paths to resources already on the server
• For example, in the user’s ownCloud data directory
• (block access to system files!)
• Additional files
• Access control
• License
• …
• When using the SWORD 2.0 interface these are put in a bag and zipped for
upload
• The SWORD interface allows upload in parts
+-test-sip/
+-bag-info.txt
+-bagit.txt
+-data/
| +-metadata/
| | +-policy.n3
| | +-record.cmdi
| +-resources/
| +-my comic.pdf
| +-secret.txt
+-manifest-md5.txt
+-tagmanifest-md5.txt
40. Security
• To hide the intricacies of XACML and design choices for content
models we use WebACL to specify the access rules for a SIP
@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
# make a specific resource (identified by the ID of the ResourceProxy) in the SIP accessible to a specific user
[acl:accessTo <sip#h1>; acl:mode acl:Read; acl:agent <#other1>].
# a colleague<#other1> a foaf:Person ;
foaf:account [foaf:accountServiceHomepage <#flat>; foaf:accountName "sarah@meertens.knaw.nl"].
# give the owner read and write access
[acl:accessTo <sip>; acl:mode acl:Read, acl:Write; acl:agent <#owner>].
# the owner
<#owner> a foaf:Person ;
foaf:account [foaf:accountServiceHomepage <#flat>; foaf:accountName "bob@meertens.knaw.nl"].
shortcuts
Shibboleth
EPPNs
41. CMDI indexing for facetted search
1. gsearch for CMDI
• Based on a XSLT that processes the FOXML
• FLAT generates an XSLT for the (internal) CMD datastream
• Based on the profiles in your CMD records
• And a VLO-like mapping
• Facet = VLO facet
• Facet = concept
• Facet = hard coded XPath
• Only the configured facets will be available
• Can also be used for the required CMD to DC mapping
• Allows to run FLAT for your CMD records out-of-the-box
2. Meertens CMD indexer
• Analyzes the profiles in your CMD records
• Creates facets for all semantic paths it finds
• Facet names based on concept links (plus context)
• At runtime switch between facets for querying and rendering
Includes indexing of collection and
compound relationships. Islandora
can use the SOLR for this instead of
the resource index (by default
Mulgara), which is needed in case of
large collections/compounds.
Replacing Mulgara by another triple
store, e.g., Blazegraph, is even better,
but requires all components to use
SPARQL instead of ITQL.
42. Deposition UI
• Drupal/Islandora module
• Create a project
• Upload a CMD record
• Or create a new one using a form
• Upload resources
• Via a project specific ownCloud data directory
• dropbox-like functionality
• possibility to link with other providers (dropbox, google drive, ….)
• no need to worry about uploading ‘big’ files
• Freeze a project
• Validate the SIP using the DoorKeeper (async)
• Deposit a valid project
• Validate and deposit the SIP using the DoorKeeper (async)
43. New vs legacy data
• New data goes via the DoorKeeper so its checked against the centres
policies!
• Legacy (meta)data can be bulk loaded into Fedora Commons:
• Convert IMDI to CMDI (optional)
• Create FOXML for CMD records and resources
• ResourceProxies should contain the local paths to resources, e.g., via @lat:localURI
• Bulk load into Fedora Commons
• Index for facetted search
• Update handles
• EPICify (github.com/meertensinstituut/EPICify)
Scripts
available, but
need to be
generalized.
44. Branding
• Drupal has extensive facilities for styling and templating
• Drupal has many modules and blocks for additional functionality
• Islandora as well, and also offers solution packs
• During FOXML creation resource specific content models can be used
• Take care, after bulk import or via a DoorKeeper action, that needed
derivatives are created
• Enable solution pack specific viewers
• Some experiments have been done
• FLAT comes with a basic style, but the MPI/TLA and Meertens
instances look very different
46. Where are we?
• Set of Docker images that extend each other to build up a complete
solution for a:
• Read only interface for bulk loaded existing (meta)data (master)
• Upload of new data via the DoorKeeper (develop)
• Update metadata resource proxies in the CMDI collection hierarchy
• User audit trails and checksums for big files
• Updating existing data via the DoorKeeper
• Versioning
• Ongoing cleanup and enrichment of (legacy) metadata and resources,
e.g., controlled vocabularies, license information
In production at the
Meertens Institute
www.meertens.knaw.nl/flat
and we are continuously
moving, cleaned, (meta)data
from the old setup to FLAT.
CLARIN B certification based
on FLAT started.
Being connected to
Meertens Institutes
questionnaire
system at the
moment.
A containerization platform that
allows easy development, testing
and deployment.
47. FLAT is moving
• github.com/TheLanguageArchive/FLAT
• Its birthplace, but FLAT is moving to
• github.com/TLA-FLAT
• Code can be more clearly split over multiple repositories
• DoorKeeper
• Bundles of actions
• Servlet wrapper
• CMDI Solution Pack
• …
• Docker setups
• finer granualarity
• Place for cooperation on
• code
• configuration
• actions
• knowledge sharing
• Q&A, issues
A Dockerfile precisely
describes what
software to install
and how to configure
it to get a running
system.
Fedora Commons,
Islandora and Drupal
documentation is
sometimes hard to
find/read and the full
stack has many layers
and corners. We can
share our experience
CLARIN-wide.
49. Conclusions
• Fedora Commons (3.8.1) provides many of the basic functionality
needed by a CLARIN B centre
• Fedora Commons has a proven record of being a stable and
satisfactory repository solution for many existing CLARIN centres
• Transition from version 3 to 4 is starting to happen
• TLA-FLAT is a modular CLARIN-compliant Fedora Commons-based
solution that is easy to step in and a platform to share knowledge on
running a Fedora Commons repository and its context
50. Thanks!
Questions?
now or later
menzo.windhouwer@meertens.knaw.nl
Please visit
github.com/TheLanguageArchive/FLAT
github.com/TLA-FLAT
TLA-FLAT team
MI: Marc Kemps-Snijders, Menzo Windhouwer, Rob Zeeman, Bas van der Veen
MPI: André Moreira, Daniel von Rhein, Paul Trilsbeek, Guilherme Silva