Transcript FAIR 3 -I-for-interoperable-13-9-17

[Unclear] w ordsare denoted in square brackets.
FAIR Data webinar series #3:
I for Interoperable – ANDS Webinar
13 September 2017
Video & slides available from ANDS website
START OF TRANSCRIPT
Keith Russell: My name's Keith Russell, I work for the Australian National Data
Service, I am your host for today. My colleague, Susannah Sabine, is
behind the site scenes co-hosting the webinar with me. Just a usual
little bit of background, the Australian National Data Service works
with research organisations around Australia to establish - or have
them trusted partnerships, reliable services, and enhance capability in
the research sector. We work together with two other NCRIS funded
projects - Research - RDS, Research Data Services, and Nectar - to
create an aligned set of joint investments to deliver transformation in
the research sector.
So this webinar is part of a series of activities we are undertaking to -
which aim to support the Australian research community in increasing
our ability to manage our research data as a national asset. So as I
mentioned earlier, this is a third in a series of a webinars around FAIR.
So we've already had the webinars on findable and accessible, and
today interoperable, next week the reusable.

Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 2 of 14
So today I will give a brief introduction about what is interoperable as
described under the FAIR data principles in FORCE11. Then I'm very
grateful that Simon and Jonathan have - are available to talk about
what they did in practice in the OzNome project to make their data
interoperable. I think it's a great example to show how this quite
complex topic can actually be carried forward in practice.
So this is what FORCE11 says about interoperable, and first of all a
few things to keep in mind. So just reiterating a few things I
mentioned in the very first webinar. So when they talk about data,
and as you look at these headings you'll see that they talk about data
and metadata, so interoperable applies both to the metadata,
describing the data collection, and the actual data itself. Another point
to keep in mind is throughout the FAIR principles they think a lot
around not only data being usable for humans, but also for machines.
That provides huge benefits in bringing together disparate datasets, in
bringing together bits of knowledge that are distributed over different
datasets.
Interoperable is a key element there to make sure that data can be
brought together, and actually can be - you can - we can get those
benefits out of bringing data together which will enable new
knowledge discovery, new relationships to be discovered, new
patterns to be recognised. All those pieces of work.
So as we look at these three headings that they have listed under
interoperable, the first one there is that data and metadata use a
formal accessible shared and broadly applicable language for
knowledge representation. To keep in mind there is that not only for
you as the - or the researcher that has created the data, but also for
another researcher that wants to understand the data and use the
data, it's useful that they understand the language you have used.
That that is a standardised language, something that other users can
also pick up and use. So ideally that is the case for the metadata -

sorry, that is definitely the case for the metadata, and ideally that
would also be used in the actual data itself.
A very basic example, if a researcher has observed that they saw a
magpie they can write in, I saw a magpie. But it's much more useful
for a researcher somewhere else on the other side of the world that
you write in that it's an Australian magpie and that is a Cracticus
tibicen. That means that a researcher on the other side of the world
has - using a standard language will actually be able to better
understand what you meant and what that description is about.
Now it's not just in the actual wording used, in the vocabulary used,
but it's also in - it's useful to have a framework around that which will
allow the data to also be machine readable and picked up by
machines and used and interpreted. Now one obvious example which
gets mentioned quite a lot is using RDF and ontologies. That is quite
common in the life sciences, and a number of life science researchers
and that were quite active in the FORCE11 group. But one thing they
emphasise is that it doesn't just have to be through RDF and
ontologies. There might be other solutions for this, and they don't
want to make it exclusively through those technologies. So that's
something to keep in mind.
Regarding the making of data interoperable, that's what I've invited
Simon and Jonathan to come and talk about, and they'll be able to talk
about it in much more detail.
The second point here is around vocabularies and using vocabularies.
They emphasise that if you use a vocabulary, well, first of all try and
use one that already exists, and is agreed on by the community. If
you have terms in there that are not in that vocabulary, but otherwise
it fits, try and get them added to that vocabulary. Finally, if that is not
possible, then, and only then, start creating your own vocabulary. So
please don't go out and create vocabularies for everything. Rather
look if there is already a community agreed vocabulary. Also make

sure that that vocabulary itself is fair. So findable, accessible,
interoperable, reusable.
So in your dataset you should have a reference to that vocabulary you
are referring to, and make sure that that vocabulary can be found just
as long as your dataset can also be found.
Final point they make is that the data and the metadata should include
qualified references to other data and metadata. So what they mean
there is that shouldn't just be a reference to another dataset, for
example, but also an indication what that relationship is. So it's not
just it's related somehow to this other dataset, but perhaps it is a
subset of another dataset or it builds on another dataset using
standardised terminology.
A little more on qualified references, from the perspective of the
metadata especially, it's valuable to not only refer to other players or
other elements around your dataset, but to do that using identifiers.
So for example, if you are describing your dataset and saying, well, it
was created - somebody was involved in creating that dataset.
Provide a qualified identifier that that person was, for example, the
author of that dataset, and if possible also use an identifier to identify
that person. That allows other relationships to be made, and it allows
further connections to be made, and that information to be picked up
and used especially in machine - when being analysed by machines.
So just a list here of possible identifiers, these are just examples there
are more identifiers out there. But for example, if you are referring to
an author include their ORCID, if you are referring to a publication use
the DOI that is related to that publication. If you are referring to
software nowadays you can assign a DOI to a software package and
refer to that DOI, et cetera.
Well, I think I've rambled on enough for now. So I would like to hand
over to Simon and Jonathan, I'm very grateful that they have made
their time available. So just a brief introduction. Simon is a research
scientist at CSIRO Land and Water's Environment Information

Systems research program. He specialises in distributed
architectures and information standards for environmental data
focusing on geosciences and water.
Jonathan Yu is a research computer scientist specialising in
information architectures, data integration, linked data, semantic web,
data analytics and visualisation. He's part of the Environmental
Informatics Group in CSIRO Land and Water.
So together they have been very active in applying their thinking
around making data interoperable in the OzNome project. Now one
thing I want to point out is that in the OzNome project they did a whole
series of work around the FAIR data principles in all different aspects.
Today I have asked them specially to focus on interoperable. But
please keep in mind that they have also done a whole bunch of other
work.
So without any further ado I'd like to hand over to Simon and
Jonathan. I'm very intrigued how they've picked up interoperability
and used that in the OzNome project.
Jonathan Yu: Okay, thanks Keith. So thanks for the introductions as well. So today
we'll be presenting on some of the work we did in the OzNome
initiative. Particularly looking at Land and Water and the data that we
have in CSIRO, and how to make that interoperable accordingly to
some of the principles that FAIR espouses. But as we will talk about,
some of the implementations that we have explored around the FAIR
principles into actionable questions to address how FAIR your data.
So if you haven't come across OzNome, this is a CSIRO-led initiative
aiming to connect information ecosystems throughout Australia. The
OzNome name was coined echoing the genome project. So Oz being
Australia, and the Nome being a gnome kind of inspired project. But
really what we're looking at here is tools, services, products, methods,
approaches and practices, and infrastructure to support having more
connected information infrastructures.

In the previous year, as Keith mentioned, we focused on
environmental information infrastructures. There's a couple of links
there you can follow. Today we'll be talking about an example in the
water space.
Simon Cox: Okay, so as part of establishing the OzNome architecture, OzNome
infrastructure, we felt that we needed to assist potential data providers
to understand what good data was, what in the context of this seminar
series, what FAIR data is, we called it OzNome data. Basically we
developed a rating - a set of rating criteria and a tool to allow
assessment by data providers of the data that they are providing.
This is just on the right-hand side of the screen here, you can see a
screen capture of the sort of the kick-off page of the tool.
You'll also notice that we've got a slightly adapted version of the FAIR
criteria - findable, accessible, interoperable and reusable - but we also
add in the last line there, trusted. Which appears to go a little bit
beyond what has been conceived in FAIR until now, but we suggest
would be a useful addition. We're kind of bundling the interoperable
and reusable together, we see those as being very closely related.
Obviously, it's teasing out some of the issues around what it is that
makes data interoperable. Keith's given a sort of high level overview
and indicated what some of the concerns might be.
We've done our own take on this, a bit - actually fairly strongly leaning
on our experience over a number of years, more than a decade now
actually of working in the data standards communities, in particular the
geospatial data standards communities. Some of the learning which
we've got from there which we're applying directly in here. Obviously
environmental data, which is what we're largely - what our heritage is,
where we've largely been working. A lot of that is geospatial so it
makes sense to be building on that.
Just a bit of a reminder, the FORCE11 FAIR principles, this is a
summary slide from Michel Dumontier, who's one of the original
authors of the papers and the developers of the FAIR principles. They

got these - the guiding principles with the four key words and are
teased out into three or four sub-principles in each case with the F-A-I
and R letters.
We're looking at the interoperable set here, which Keith has already
shown. It's interesting that Michel has recently done a study
evaluating a number of repositories, particularly in Europe and some
of them are broader than that, but here's the list of repositories that
were evaluated. Scored those on the FAIR principles, the data's
available in this form actually, this table, shoots off to the right of the
screen and there's lots more going on there. But looking at the
summary of the results it's fairly notable that the tallest red bar here is
in the interoperable category. So what this is saying is, of the FAIR
data principles this is the one which is hardest to meet, the one that's
hardest to conform to.
So really that's the focus of the approach that we've taken, which is to
kind of lead people through how they can make their data more FAIR,
more OzNomic, more interoperable. The particular way in which
we've broken out the question of interoperability is on, if you look at
the numbered terms here, is it loadable, is it usable, is it
comprehensible, is it linked, as well as is it licenced.
I'm just going to go through some of the details of those, and you'll see
the - sense it's fairly repetitive of some of the concerns that Keith
explained at the beginning. But we're putting some more concrete
examples onto these criteria just to indicate to our data providers that
when we say a standard data format we mean something like, CSV or
JSON or XML or netCDF. These are all important file formats towards
the left, and then they're kind of general, but netCDF is one that's
used a lot in the remote sensing and environmental science
communities.
So we've got a bit of a ladder here of different levels of conformance
which you can reach about whether a dataset would be loadable. Is it
in a unique file format? Well, that means that you've got to have some

unique software to load it. Or is it in a standard data format, and
normally that would be denoted by one of the standard MIME types.
Best of all would be for data to be provided in multiple standard
formats, giving a choice to the user so that whatever their favourite
platform for loading data they can use.
Next question, even when you've loaded it can you use it? If it's - if
the structures within the dataset, even if it's loaded, if the structures
are unclear then it's not going to be very usable. That comes down to
the matter of, is there a schema that's provided which makes explicit
the structures within the datasets. A lot of sort of traditional data,
yeah, there's a structure in there but the schema's not available
independently of the data, if you like the schema is implicit. It's not
formalised. The schema maybe is different every time.
A lot of spreadsheets are done that way, a spreadsheet has got a lot
of boxes. But if every time you use it you add different columns and
use the pages in a spreadsheet in a different way, then it takes a little
while for the users to get their head's around what's going on before
they can use it. So there's various explicit schema languages like
DDL, which is loaded and used for relational systems, XML schema.
There's something coming out in the open knowledge world these
days called data packaging, which allows you essentially to describe a
schema for a CSV file. Then you've got in the RDF, the semantic web
space, RDFS and OWL. JSON even has a schema language these
days, although it's not broadly used.
So it's nice to provide data with a schema, but best of all would be to
say, the data I am using I am using this community schema. This
community, and for example the Open Geospatial Consortium
provides a number of community schemas for observations, for time
series, for hydrology, for geoscience. If you're publishing or
attempting to share data in any of these disciplines then best to go off
and find a community schema.

Then even when you've got it loaded and you understand what the
structures are you've still got the question about what the words and
numbers are inside the boxes. Do the column headings, are they
explicit enough to understand, are they just shorthand for something
which the project leader when he was developing the data knew that
he or she would understand it the next week. But even he or she if
they came back to it the next year may not understand it. Best of
course is if the field labels are linked and do have explanations,
probably in plain text. Better still is to use standard labels, for
example the universal code for units of measure, units codes. Of the
climate and forecast conventions coming out of the FluidEarth
community.
So the ladder that we've got her says, oh, you're using standard
labels. Is it just some of the field names are linked to standard
externally managed vocabularies, or are all the field names linked to
standard externally managed vocabularies? You get this ladder better
and better and better.
Then the question about how well linked is your data? Well, if it's just
a file sitting on a service somewhere and there's no links in or out,
yeah, you're lucky to find it. If most of the datasets that we're - that
this community would be expecting is that they are indexed in a
catalogue or they are available from a landing page. That's the
situation where you've got inbound links to the dataset. Best of all is
when there are outbound links embedded or implicit in the data
structures in a dataset which says exactly how it's related. This links
in with some of the previous concerns that we had there about field
names and these kinds of things.
So I'm going to hand back to Jonathan to tell you tease through a
case study that we've got here really based on the AWRA-L - the
Australian Water Resources Assessment datasets. So Jonathan.
Jonathan Yu: Yes, so as mentioned earlier, in the OzNome project we looked at a
practice example and a case study in the AWRA-L dataset. This is a

continental cell dataset that has historical time series from 1911. The
bureau published an operational version online, and you can find that
on their website. But often scientists have to basically deal with this
dataset by knowing where it is and knowing how to use it implicitly.
Knowing how to reference the requisite geospatial features and
understand the field name values.
So I've got an example in the - oh sorry, so the next slide shows the
assessment of it using our tool. Just focusing on the interoperable
side of things we have rated it as a web service, you can get it by the
web. However the reference definitions are text only, and they are
localised in the dataset itself. Now I'll give an example in the next
slide.
So this is coming out from the netCDF metadata that this dataset, you
can access this via online through THREDDS or via their netCDF
tools. But this is a summary of the metadata that comes along with
the data. So we've got long name here, Potential evapotranspiration,
we've got the name which is a label for the field, e0_avg. Units, mm,
and a standard name which is a convention in netCDF to refer to the
actual - to a property which is e0_avg, which in this case isn't part of
the CF conventions that's often used with this format.
So if you are an expert in this area and you've used this dataset many
times you will know what this is. If you are a newcomer you have to
do a lot of work to - well, a little bit of work to understand what actually
this data field needs.
In the OzNome project what we did was enrich this with external
variables. So if you go to the next slide Simon, so this is the same
field. We've added - these added lines at the bottom here, they tease
out what this particular data field means in the context of externally
defined vocabularies. So we've now enriched this with a scaled
quantity kind identifier, Potential evapotranspiration. It's an http URI
where you can resolve it and get a definition. So similarly for
substance ortaxon, unit ID and feature of interest.

We'll just talk about what they are. So this is what - a part of the
project was to explore, could we define vocabularies for these from
which we could reference outbound links from the data to the
definition. This is just a summary of what we did in the context of the
AWRA-L dataset. This is an example of potential evapotranspiration.
We've got a conception model here where we've got broader notions
of potential evapotranspiration. We've got linked relationships out to
thinks like feature of interest, object of interest, and unit of measure.
So this view provides a vocabulary entry for potential
evapotranspiration, not only the identifier for it, not only the description
for it, but a richer model than you would get from if you just had
something inline. So you've got outbound relationships from this
concept to its related concepts essentially. So this is a demonstration
of defining the concepts externally, having them quite richly explained
through this medium, but having the ability to link that from the dataset
itself to this definition to make it more interoperable.
So that if we have another dataset that talked about potential
evapotranspiration it could potentially be linked and interoperable. A
revised OzNome maturity estimation using the OzNome five-star tool
and just focusing on the interoperable field we see that it's, for using
the same tool and assessing it based on the criteria, we've gone up
form two star to more than four stars in the interoperable space. The
reason for that is that we now have reference definitions as linked
data and externally hosted observed property vocabulary definitions.
Rather than just inline labels of what it is.
It provides more interoperability and if the vocabulary was
standardised then we would have a higher estimation in that field. But
it's just a demonstration of how we went about making something
more interoperable through the OzNome project.
Simon Cox: Yeah, I'll just pick up at the end here and just comment that when we
were starting this data ratings exercise we actually didn't look at FAIR
at the beginning. We developed our own set of criteria, these key

words here, and then subsequently correlated them with the FAIR
principles. One of the interesting things was there was three lines in
this table here, the ones in red, which didn't correlate with concerns
that had been identified within FAIR.
The first one might be seen as trivial, but we thought it was a question
that was worth asking, particularly when working with research
scientists and talking about making their data available, which was the
question about, the first question, is your data intended to be used by
anybody else? There's lots of data generated which is never shared.
Now that's not necessarily a good thing, and to a certain extent having
the question there highlights the fact that there is a question to be
asked and that some scientists need, researchers, need to be
encouraged to think about making their data available, about
publishing it.
So I think in terms of the FAIR principles this one was kind of the
implicit starting point. If it's published, yes, it's implicitly FAIR.
A couple of other rows, one concern which comes up, particularly
we've worked a lot with agencies that have sort of systematic data
collection processes with systematic curation and maintenance
revisiting. A dataset is refreshed every day or every month or every
year, all that. That concern didn't seem to be particularly addressed in
the FAIR principles as they stand. So we'd say the concern about
whether the data is expected to be updated and maintained, and
maybe a bit more than FAIR.
The bottom row there was well as the concern about this is a, if you
like, an elaboration of the assessment of data that you might do,
which is to get some information about how well trusted it is. Now a
lot of that is about who else is using it, how much it's - well, that's often
the criteria you'll use. Who else is using it, how many times is it being
used, what other products have been generated from this dataset and
so can I trust it?

So just emphasising that row there is the interoperable, it corresponds
with the interoperability which is what we've really been focusing on
today. The use of standards I guess. Standards is a funny word, you
have to be a bit careful with it. Capital S standard, sometimes people
think that's just to do with ISO of Australian Standards or whatever.
Really the point of that standards is that they are community
agreements. They are community agreements which are available for
additional members of the community to join in. But it's important to
think of them as agreements - agreements to do things in a common
way.
So finally just a slide with some links to some of the material that
we've been showing today. We'll say thank you for listening.
Keith Russell: Thank you Simon, thank you Jonathan. That was really interesting
and a really useful way to see what it actually means in practice.
Because in think interoperable can be quite a complex difficult subject,
sometimes also one that requires much more knowledge of the actual
field of research that's going on that you're talking about. So think this
is a great example of where you've been working in a specific field to
try and make that data more interoperable.
Thanks very much for your time, and this is a really interesting
discussion and really starting to tease out a number of the issues, and
a number of the things that probably will need developing further.
I've just put up a slide which links off to a number of resources, and
some of these Simon already mentioned. So ANDS has a service,
Research Vocabularies Australia, which anybody around the country -
or actually internationally also - can use if you don't have your own
tool to set up a vocabulary. That is a possible way of doing. There
are also already existing vocabularies in there. So have a look at that
if that's of interest. We also have an interest group that works in this
space.
If you are looking at the metadata and having qualified relationships
within the metadata and using identifiers, there's a few links there to

places where you can find information about possibly identifiers. We
are also trying to pull that metadata, describing datasets, together and
sharing that internationally through a number of hubs. That's taking
place through the Scholix project. The Research Data Australia is sort
of an Australian hub contributing into that international hub -
international effort. So have a look there if you're interested.
We did 23 research starter things last year, and two of the things are
relevant for our discussion today. If you are interested in digging into
it a little further and discovering a little bit more about it, and
discovering what the vocabularies mean in practice, have a go at
Thing 12. Or if you are more interested in the identifiers and link data
have a look at Thing 14.
Finally I would like to first of thank Simon and Jonathan again for their
time and for the excellent presentation and the insights that they
brought to the table. Finally we would like to acknowledge NCRIS, the
National Collaborative Infrastructure Strategy Program that provides
the funding for ANDS.
So thanks again and look forward to seeing you all next week.
END OF TRANSCRIPT

Transcript FAIR 3 -I-for-interoperable-13-9-17

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Ähnlich wie Transcript FAIR 3 -I-for-interoperable-13-9-17

Ähnlich wie Transcript FAIR 3 -I-for-interoperable-13-9-17 (20)

Mehr von ARDC

Mehr von ARDC (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Transcript FAIR 3 -I-for-interoperable-13-9-17