This PowerPoint helps students to consider the concept of infinity.
Transcript FAIR webinar #2: A for Accessable-06-06-2017
1. FAIR Data webinar series #2: A for Accessible –
ANDS Webinar
6 September 2017
Video & slides available from ANDS website
START OF TRANSCRIPT
Keith Russell: Welcome everybody to the second in this series of webinars about the
FAIR data principles. Today we are up to A for accessible. Last week
we talked about the first one, findable and now accessible and next
week we'll talk about interoperable and the week after that about
reusable. First of all, I'd like to introduce myself. My name is Keith
Russell. I'm from the Australian National Data Service. I'm your host
for today.
A big thank you to Susannah, Susannah Sabine in the background,
she's organising and co-hosting this webinar with me. Just as a bit of
background, the Australian National Data Service works with research
organisations around the country to establish trusted partnerships,
reliable services and to enhance capability around the sector to add
value to research data and to enhance the capability in the research
sector.
We are working together with two other NCRIS funded projects. So
that's RDS and Nectar, to create an aligned set of joint investments to
deliver transformation in the research sector. There you are. We
have three speakers for today. I'll do a quick kick off and just give a
very brief introduction to what the FAIR data principles say about
accessible.
[Unclear] words are denoted in square brackets
2. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 2 of 16
Then I'm really excited and very grateful for two of our speakers today.
David, David Fitzgerald. He is in this webinar and he doesn't have a
webcam, so that's why you don't see him at present. David is a data
manager at the Australian Longitudinal Study of Women's Health.
David is going to be talking about how in the study and how in the
data that's being provided, they make the data accessible.
I was especially interested in this perspective from the angle of
sensitive data and making sensitive data accessible. The other
speaker for today is Jingbo, Jingbo Wang, from NCI. I've asked
Jingbo to talk a bit about where - how NCI makes their data
accessible using services for the data. They can be interrogated used
by humans and machines. First of all, I'd like to give a brief
introduction about the A in the FAIR data principles.
The A stands for accessible. The way it's been described and the way
FORCE11 described the principles is that metadata, so data and the
metadata, both of them, are retrieved by their identifier, using a
standardised communications protocol. When we talk - when
retrieved by their identifier, that's the identifier we talked about last
week. That can be a DOI, a handle, a perl, something that's
persistent. By using the DOI, handle or perl, you should be able to get
access to the data or the metadata.
The protocol to get there should be open, free and universally
implementable. The thing to think about there is that it's something
that is a protocol which is standardised and used by - can be used by
anybody. It's not something that is bespoke. Not something that is
home built or badly documented. The classic example is just htttp.
That's the very normal way of using it through internet accessing
materials and accessing data.
It should not require some specialised expensive software. Another
point they make in the data principles is that the protocol should allow
for authentication and authorisation procedure where necessary. This
is a common misunderstanding, is that when people read accessible
3. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 3 of 16
they think that means I have to make my data open. If you actually
read the FAIR data principles, that's not what they're saying.
What they're saying is accessible does not actually have to be open or
free. But you are expected to give exact conditions under which the
data are accessible. Even heavily protected and private data can be
made fair. If you implement it properly, implement the FAIR data
principles properly, then a human being can see that the data is
maybe not openly available, but then what steps they need to take to
get access to the data and because in the FAIR data principles they
also talk about machine access to data.
If a machine goes hunting around looking for the data, the machine
should be able to recognise that the data is not open and what steps
need to be taken to get to the data. I'll talk about that a little further. If
the user, so that's either the human or the machine, has been granted
access to the data, then it should be accessible through some sort of
authentication and authorisation procedure, standard procedure.
The last point they make under the FAIR data principles about being
accessible is in the case, the case in which data is no longer
available, at least the metadata should be accessible. This is of
course not ideal. But in some cases it is necessary to take the data
down. That could be if consent for use was only for a limited period of
time or maybe there has been a legal takedown notice or something
along those lines that really make it impossible to no longer make the
data available.
In that case, it is valuable to still keep up a metadata record describing
the data and explaining that the data is no longer available. Now just
to reinforce that accessible does not always have to be open, there
are clear cases in which data cannot be made openly available.
Obvious example is where data refers to human beings and specific
characteristics of those human beings, like information about their
health, their income or religion, attitudes, political persuasion, all that
sort of stuff.
4. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 4 of 16
That's not the sort of information you can make publicly available.
Other examples, and that's probably worth remembering, is that there
are other sets of data, for example, a threatened species. The
location of where threatened species are can be data which is not
something you want to make openly available, because that could
mean that the last few of those species are hunted down or collected.
A famous example, the Wollemi Pine, the location of that - of those
specific species needs to be protected. Finally, the - another example
where data cannot always be made openly available is whether our
commercial interests in the data and maybe the metadata can be
shared. But the data itself is - there are commercial interests around
that. In that case, it would not be appropriate for that to be made
openly available.
When considering making data accessible, we do argue to make it as
accessible as possible and as openly available as possible. Possible
angle there is just to provide the metadata as a starting point. If the
rest cannot be made available, at least the metadata. Slightly more
useful perhaps is making it available through mediated access and in
that case, it's valuable to be clear about how the user can actually get
access. That can be through by providing an email address, name,
telephone number.
If, for example, the user has to through an ethics procedure to get
access to the data, then clearly describe that ethics procedure and
what sort of information is required to apply for that ethics procedure.
I was talking about the mediated access and about providing
information about who to contact if you want to get access to the data.
One thing to keep in mind there is if you are - if you list a person or a
person within the organisation, have a think about whether that person
is ever going to leave.
If that's a researcher. If they are going to another organisation. Have
a fall back, have some sort of mechanism to make sure that or maybe
a more general email address. So when that data custodian leaves,
somebody else can at least answer the question and grant access to
5. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 5 of 16
the data. Another possible angle in making data accessible is
creating a de-identified version of the data and making that public, as
long as it's properly de-identified.
That can be useful for certain data users. At least have a better view
of what's in the dataset. For some purposes, a de-identified version
can be enough. Finally, a good point to keep in mind is if you do want
to make the data accessible, plan for this in your consent forms,
because coming back afterwards and trying to get consent is not
easy. Another angle worth keeping in mind and that's something I've
invited Jingbo to talk about more, is making data accessible.
It can be through various roots and various protocols. In some cases
it doesn't make sense to have a large dataset available through
download. In some cases it can make much more sense to have
services over the data which allow the users to interrogate parts of the
data, pull in parts of that data that a much more specific and much -
and answer their requests. That can be for a human being, but
especially for a machine, that can be extremely useful.
One thing to keep in mind there, you need some sort of community
agreed standards around that. But Jingbo is going to talk much more
about that. So that was all from a much more theoretical perspective.
I'm very grateful that I have two speakers today to talk about
accessible in practice and how they've actually tackled making data
accessible.
The first speaker for today is David, David Fitzgerald. He is the data
manager at the Australian Longitudinal Study of Women's Health. I'm
very grateful that David is available to talk about what ALSWH has
done to make quite sensitive data still accessible for others to reuse.
David is on the line and I would like to hand over to David and then
David can talk about how the - how in the Australian Longitudinal
Study of Women's Health they have made data accessible.
David Fitzgerald: Thank you Keith. Okay. I am David Fitzgerald, the data manager for
ALSWH, that's how I pronounce it, the Australian Longitudinal Study
6. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 6 of 16
on Women's Health. I'll be talking about the accessibility issues for
this. I'm going to first of all explain and give background to our study
and then talk about the accessibility issues and try and relate them to
the FAIR data principles, which I've just listed here.
These are these act ones, which Keith showed earlier. I won't go
through them in detail. But I'll try and relate these to our study. Okay,
so what is the ALSWH study? It's a collaborative effort, project from
the two universities of Newcastle and Queensland. In fact, the two
universities there, related to keeping the sensitive data, which I'll talk
about briefly. It's one of Australia's longest running longitudinal
epidemiological studies.
It's been going since 1996 and is ongoing. We hope to go further into
the future, funded by the Australian Government. We started off with
over 40,000 women and a few years ago we got a new cohort of
17,000 women. I'll show you the four cohorts we work with. Here
they are. The four cohorts are aged based and we define them in the
years of birth. You can see there is one - the oldest one born 1921 to
1926 and there are three other ones of various ages.
As you can imagine, each cohort has their own health issues and
that's what we're interested in and indeed, the Australian Government
is interested in. What are we collecting and our methodology, so
health issues, in particular mental, physical, reproductive, social
health. There is more and also life transitions, the different ages of
women obviously going through different life transitions, life events
and things which are related to health and employment, health service
use and more.
I'll just mention a bit of data linkage. I don't want to stress this
because it's a big area with lots of issues. But we have actually linked
our survey data with some administrative datasets. In fact, they're
listed there. The NBS, PBS and Cancer Registries and admitted
patient hospital. The linkage is particularly sensitive and we treat
them quite differently in how we make the data accessible.
7. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 7 of 16
The data is used extensively and in particularly more than 680 peer
review papers have been published using our data and also we report
back to the Government frequently and national health policies have
been informed by reports and use of our data. I'll go on to the aspects
of accessibility and to see how it relates to our data. So that one there
about being retrievable by an identifier using standard
communications protocol.
All the datasets from our survey which are analysed and are used
have an identifier, the same identifier and I'll just stress here, it's de-
identified but with a consistent new identifier. That's across all
surveys. Anyone using our survey data - I'll just put the caveat. As
long as it's not part of the linked data. But anyone using this survey
data has one and only one identifier for use.
We say this has been de-identified because there are no personal
names on the data. No addresses. No postcodes. No dates of birth,
although the year and month of birth are actually given, obviously to
do things like age analysis. Any - they're the main ones. But any
other data which is deemed to be identifiable is stripped off. The
identifier is - we call it the ID alias.
It's actually not the administrative ID, which a respondent would see or
somebody working in an office in Newcastle who is communicating
with our respondents. They would not know what the identifier - the
analysable identifier is. They would have a different administrative ID.
Just on this point. Any small cell sizes which we think are identifiable
are grouped into larger groups.
For example, country of birth we group into broad continental,
geographical areas to avoid particular countries of birth coming up.
Anyone using the data has to - along with a number of other
conditions – they must not identify a respondent, which although we
go to lengths to make that very difficult. It's conceivable that
something could come up.
8. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 8 of 16
But they promise and sign that they will not identify respondents if
they ever have that possibility. I was also asked to look at legal and
ethical issues. We do have a legal contract with the Australian
Government Department of Health. The fact that this is ongoing and
we didn't get a 20 year one. They are regularly updated and short
term contracts. Also, the ethics committees from the two universities
there have approved usage.
In fact, every time we do a new survey, because it's longitudinal, every
year we're actually going back to at least one of the cohorts to survey
them. Each new survey which is not identical to previous surveys is
subject to ethics committee oversight and approval. We do have
extensive legal and ethical issues there. I want to talk about how an
investigator or a re-user would get access our survey data.
They - and as we explained, this is all on the website. But they must
first complete an expression of interest form and particular they would
say who they are, why they are a serious researcher, what they want
to find out from the data. That would be reviewed by our publications
sub-studies, that's the BSA committee.
Then if their EOI, expression of interest, is approved, they will sign the
confidentiality data use documents, statements, before receiving the
de-identified data. They also must report back to us about their
progress and we expect some sort of - some immediate work on the
data and for them to continue with that access. But if their expression
of interest is successful, the data are sent to them and this is an area
which I'm directly involved in.
We do, before sending it out, encrypt it. We use 7z software and
that's compressed as well. We use the AARNet CloudStor system to
send data to the approved researchers, reusers and an email was
sent to them as well with passwords, but also to establish contact with
the management here, for future correspondence. I just put a note
there about we have linked data, but we never send this out. Anyone
using this has to come to our offices.
9. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 9 of 16
Or there is the Sax Institute shore facility which also can have it. But
we don't own the linked data. We have agreed not to send it
anywhere. Public metadata, this refers back to protocol being open.
We have a website which lists the above procedure, in fact, that I went
through. But also has a lot of metadata on it, including a data
dictionary which lists all the variables and the many datasets we have,
a data dictionary supplement, which is a description of the frequently
used variables with some detail, a data map that shows how the
variables are used across the different surveys and cohorts.
When I say different surveys, the longitudinal, we have up to eight
surveys for some of our cohorts. Each one is deemed a different
survey and has slight differences from other surveys. We have a list
of all the variables used and spreadsheets for easy access. We also
have data books which list the frequency summaries of the variables.
The questionnaires that the respondents filled in, technical reports
which we produce that go into detail on many of our reports and a
frequently asked question page on exactly that. So, making metadata
accessible. In fact, we make data - although our data is not
completely open, we do want to make it accessible. We do archive
both the metadata and the data and we do that annually with the
Australian Data Archives.
Although they are not releasing it yet, the plan is in the future for them
to take over release of our data, perhaps when we're not doing it
ourselves. That will be a role to keep our data useful and used in the
long term. That's what I've got to say. I'd just like to acknowledge the
women in our study who fill in the surveys and of course the
Government Department of Health for funding us and the Universities
of Queensland and New South Wales for doing the job. Thank you.
That's what I have to say.
Keith Russell: Thank you David. Thanks, just really interesting presentation.
Interesting to hear how you've made data accessible in practice and
what it means to make sensitive data accessible to researchers.
10. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 10 of 16
Thanks for that perspective and thanks for that view on how quite
sensitive data can still be made accessible through various roots.
I think it's really interesting to hear that you both have the root of de-
identified data through appropriate roots, but also linked data. So a
much richer version, but then through either Sax institute shore facility
or through coming to the ALS,WH The Australian Longitudinal Study
of Women's Health. I've got to work on that one. Thanks.
Okay, I would like to now move onto Jingbo. Jingbo has got a - I've
asked Jingbo to talk also about making data accessible through a very
different perspective. Jingbo works at NCI and they're - NCI does all
sorts of elements around making data findable, accessible,
interoperable, reusable. Today I've asked Jingbo to focus on the
accessible side of things. But I do want to note that NCI also does a
whole bunch of other things in this space.
Jingbo Wang: Thank you Keith. I think I will just turn off my camera, because I can't
see my presentation. My name is Jingbo Wang. I work at National
Computational Infrastructure, which is a super computer centre
located in Australian National University campus. Today I'm going to
address different flavour of data accessibility practice at NCI. Before I
go further, I just want to make a comment that FAIR principle is quite
useful to govern our data management practice.
We use it a lot in every single aspect in our data management. This is
a quick overview of the dataset we have. As you can see, I've listed
here the main data type that we store at NCI are national collections
about climate models, satellite images with bathymetry elevation,
hydrology, geophysics. Those data are quite geospatial focussed.
But we also have other social science data and genomics sequencing
data and astronomy data. We aim to provide a user with data as a
service, as many digital repositories will do. In our data management,
we catalogue data so that people can query the metadata database to
find what we have here. We also publish data through various data
services. That's a focus I'm going to talk about in the next few slides.
11. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 11 of 16
We offer data quality assurance, data quality control and
benchmarking use cases. We provide data through virtual
laboratories. We also provide help on data visualisation. If I wanted
to make something that we are different from other different
repositories because we are co-located with HPC facility, high
performance computing. Given the nature of our large scale of the
data, we host more than 10 petabyte research data.
We really want to make good use of the high-performance computing
here to advance science research. This is the six dot points that I
wanted to address today about data access. I put the red colour
words to show the difference for each point. Initially, I will talk about
the - how do we control the data access and then I'm going to present
one example of how do we use process in identifying to manage data
access.
Then I will talk about two main data services that we offer at NCI for
our users when there are threats, when the other one is GSKY, which
is a more fancy and scalable distributed data server. Finally, I'm going
to cover very quickly about the data versioning and the quality of the
data. The first point is about how do we control the data access.
Most of our data are coming from our stakeholders, such as
Geoscience Australia, the Bureau of Metrology, CSIRO, universities.
Many data has been funded by Australian Government, so it naturally
falls into CC BY 4.0 licence. Some owners also impose that the data
should be non-commercial, non-derivative or share a like type of CC
BY.
We also have international partners such as in the European and US
and they impose even strict terms and conditions if people wanted to
access the data. This is the legal perspective about how do we
control the licence, data access through licences. On the file system
we hardcoded the data access control using echoes.
This is a way how do we separate different groups of people
accessing the same data. We have - basically for each collection, we
12. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 12 of 16
have two access groups. The first group has a read and write
permission, which means those are data managers who are able to
generate data, write that and modify data. The second group is read
only group.
For those people who are in the read only group, they can access the
data on the file system. But they can't really modify anything. This
way we actually protect the integrity of the data. We only give access
to an authorised person, who really can manage the data. There is
also a social aspect of data access. For a research project, we often
see the embargo period that maybe after two years of the project the
data can be made available.
Also, some researchers say I want to share my data after my journal
article about this dataset is published. Another example is from the
Bureau of Meteorology. We have a data that where there is a six
months' time delay between the data is being developed, verified until
it is being operational, available on our THREDDS server.
The second point I wanted to raise is our practice about implementing
a process identifier. Often we experience some frustration about
when we give people the URL to access the data it is only valid for a
certain period of time or only valid during the time that somebody can
maintain it. Afterwards, we can't really guarantee and also the URL,
the original URL, if you look at on the left-hand side of the slides.
Those are the metadata catalogue URL or service endpoint URL.
Let's look at the second one, which is service endpoint. From this
URL [unclear], you can tell the later part which includes the project
code, file path, file name. Anything in this path, for example, project
code change off - you rename the file or we shuffle the file around and
this link will be broken.
The original URL that we provided here is not a very stable one. We
adopt the product that the CSIRO developed some time ago, about a
persistent identifier as a broker. We now - most of the time we give
the external user the right-hand side, the name combination. As you
13. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 13 of 16
can see, we have four main categories after a pid.nci.org.au. We
have dataset, we have services. We have documentation and we
have vocabularies.
The only thing could be unique is the file identifier or UUID. It's
basically - as long as the identifier keeps the same, the URL on the
right-hand side is pretty consistent. If anything changed in the original
URL on the left-hand side, what we need to do is update the mapping
inside of the PID service broker without interrupting the URL that we
give to the external user.
We have the technical implementation published in the teachers'
science journals, so you are welcome to have a look. Now I'm going
to talk about the main data services that Keith really wanted me to
address from NCIS' perspective. I divided our type of data service
into two main groups. One is the OGC services. I'm going to talk
more about what is OGC in a second.
The other type of data services is more project specific, such as we
are one of the largest node in the southern hemisphere as part of the
Earth Systems Federation Grade which is the aggregation of climate
model from Global Research Institute. The way we provide services
is we copy the main of the data model to serve for Australian users.
Another fancy data service I am going to show you a bit more is
GSKY. It's a scalable data server that directly interacts with our file
system. What is OGC? OCG is Open Geospatial Consortium. It is
an international non profit organisation to make quality open
standards for global geospatial community. We find OGC standard is
quite useful for us because we have a lot of geospatial featured data.
OGC have all sorts of standards for different types of mapping, future
coverage processing for us to use. Because it's so common and it's
free for people to use and if we made data available through OGC
standards, a lot of people naturally can access our data. That's the
motivation. What is OGC services? It's actually an API in the middle
between the data store and the user.
14. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 14 of 16
The user can request whatever available on OGC services. Let's say
I want a map about the anomaly across whole Australian continental.
NCI host this data. But we host the data. We don't host images.
What's the OGC web services do is they actually extract the image
and return back to the user. The user can take the URL, which
contained the image of the data, put on their own web portal. For
example, you can get the URL and copy and paste onto the national
map to show the grades.
NCI has two main production data type service. One is the
THREDDS. You can often find the THREDDS available on our data
catalogue. This is the interface of the GeoNetwork. The red circled
link is the NCIS web server. You can open and click it. A second
interface is a data catalogue. They more or less contain the same
information, but serving for different purposes. GeoNetwork is mainly
for data harvester, for machine accessible.
The data catalogue is for human readable. THREDDS, in a very
simple term, is it's data services which allow you to browse and
access the data. I've listed here six main types of services that
THREDDS offer, the very first two OPeNDAP and NetCDF is subset,
sub-setting the data. We have a lot of very large data. But in practice
when scientists access the data they don't necessarily have to access
all the data.
They might just need a very small piece of data from this big pool.
What the THREDDS can offer is, you can define your query and only
get the data, the part that you want. It's really saves a lot of traffic on
the internet. The other two standard OGC web mapping services, the
Web Coverage is very popular for people to access the mapping and
coverage directly out of our data.
Of course, THREDDS offer a very quick data viewer. If you don't
know what this data is, you can have a quick look of what it is on the
web, without downloading it. Of course, also the THREDDS offer the
direct download, if you really want to download the data. Another
15. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 15 of 16
fancy scalable distributor data server that I was talking about is called
GSKY. GSKY is the in-house NCI developed product.
What it does is we have a lot of data on a file system, millions and
millions of files on the system. If we wanted people to query this data,
how? It's going to be very harder to create millions of metadata
records for every single file. What we've done is we use the crawler to
crawl the file system, get the header of the file and formulate as a
database, metadata database.
Then the database will be a clear window for people to hand in the
request. It gives me some images in the polygon at what - at some
time. The metadata database actually include those essential
geospatial information. It returns back to user of what they requested.
We have published recently technical details of GSKY
implementation. You're more than welcome to have a look.
Keith Russell: Jingbo? Sorry, Jingbo. I think you're getting close to the end. I just
wanted to ask you - there is only about a minute or two left, so if you
could work towards the end that would be…
Jingbo Wang: Sure, I'll quickly go through. The last two points will be version data.
Again, because of the scale of the data, we can't really store every
single step of the data. What we can do is we store the raw data and
the final version and we keep the URI of the metadata in the middle
step. In that way, the provenance information was kept and also
saved the storage.
The last point of the quality data is I would think some users say we
can't really assume we can access data and data is flawless. By
publishing data, aside with the quality report, we wanted to provide a
data access with a certain type of assurance. We also have the
publication that is going to be in place very soon. Thank you for your
attention. That's our experiences so far about this access.
Keith Russell: Thanks. Thanks Jingbo. That was really - a very quick overview of all
the work you've been doing there around services and all the work
you've been doing there about making data accessible, not only for
16. transcript-fair-webinar-2-accessable06-06-2017-170929031312.doc
Page 16 of 16
humans, but also for machines. First of all, I would like thank David
and Jingbo again for providing an insight into what it means in practice
in making data accessible from different perspectives.
That was very interesting presentations. In case you are interested in
learning more about making your data accessible and things you can
think about there, this slide provides you some resources. The
med.data data project has got a number of materials around sensitive
data. There is a link here to the Australian Data Archive and the
access conditions there. On the ANDS website have some materials
on sensitive data.
Another piece of work we're doing together with the community is
looking at data services. This is the work that Jingbo also talked
about, making sure that the services over the data are discoverable.
There is an interest group working in this space. If you are interested
in learning more about it and also engaging more around that, please
follow the link and there is more information in there about that data
services interest group.
Last year we also did 23 things, research data things and two of the
research data things are relevant to the topics discussed today. Have
a look thing 10 and thing 19 if you want to learn more and also want to
get your hands dirty and try out a little bit what it means to make data
accessible. The link at the bottom is just a general link about the
FAIR data principles on the ANDS website.
This week we talked about accessible. Next week we are going to be
talking about interoperable. Thank you all for your attention. Finally, I
would like to acknowledge and thank first of all our speakers for today
but I would also like to thank NCRIS National Collaborative Research
Infrastructure Strategy Program for funding ANDS and making this all
possible. Thank you all for your time and look forward to seeing you
next week.
END OF TRANSCRIPT