Asian American Pacific Islander Month DDSD 2024.pptx
Transcript - Tracking Research Data Footprints via Integration with Research Graph
1. [Unclear] words are denoted in brackets
Webinar: Tracking Research Data Footprints via
Integration with Research Graph
1 March 2018
Video & slides available from ANDS website
START OF TRANSCRIPT
Facilitator: Good afternoon everyone, thanks for coming to the webinar today. We
have a talk today on the topic of tracking the footprint of research data
across infrastructures, using the Research Graph API. The speakers
today are Doctor Ben Evans from NCI, Associate Director of NCI, and
Doctor Jingbo Wang who's a Collection Manager in NCI. So with that
introduction, I'll actually hand over the talk to Ben for starting the talk.
Ben Evans: So we're going to be talking about work that's going on to help track
research data and how it's used in a broader setting. I should mention,
NCI's got a lot of partners as a part of this that have been backing and
worked with us in this, including from NCRIS and Bureau of
Meteorology, Geoscience Australia, CSIRO, the ANU and a host of
other partners and collaborators, including ANDS in particular, for this
work.
So some of the open questions, motivating questions, beyond just
getting data management in place is - so say you publish data and
datasets, is how is the research community actually connecting with
that data? After you've put it into a public arena they could be
connecting with it in various ways and making [use], so how do you
track that? Also, how do you track the impact of that investment of that
2. Page 2 of 8
research data for other derived products downstream? So that's a
challenging question that we can't answer fully with inside] of a single
centre; you're really into an international world. That motivated us a lot
to be working on this particular project which has part of the solution.
So I should say that the standing of this work and this piece of
infrastructure that we'll be going through on Research Graph started
with a fairly small partnership. But now it's grown quite a bit and RDA,
Research Data Alliance, have picked it up as this Registry
Interoperability Working Group. It's got a number of players. You can
see some of the players who've been strongly supporting this work
over a period of time listed there and you can follow that link on RD
Alliance website to track this. But, furthermore, now really through
Amir's good work and others, the European Commission have picked
this up and said, yes, this needs to now be pushed into an ICT
specification. So all that is to say that this work is now on a pretty
strong pathway and well worth paying attention to now as it goes
forward.
So there's four types of what we call nodes in this graph network when
you're publishing data and using data. So one is the researcher, one's
the dataset, one's the publication, one's grants. There could be other
nodes as well, but the status of these whole graphs at the moment is
basically built up of those fundamental areas. When we get down to it
inside of the tool, you can see the attributes through that graphic on
the right-hand side. Research is always in green and datasets are in
orange, publication is blue and grants in yellow. You can see some of
the attributes that are listed there and we'll talk about that.
The other thing is that this graph network that's been built up
understands very well-known metadata standards like ISO 19115-4;
that's geospatial data, a lot of geospatial data fits into that. But also
things like RIF-CS that's used in the librarian world, and inside of
Research Data Australia - if you know that catalogue - uses RIF-CS,
and MARC 21 and there are others as well. So just to say that this
graph system is already supporting that framework.
3. Page 3 of 8
For NCI, we make a number of major national reference datasets
available on NCI. We've curated them and put them into a certain
form. They come, in principle, from a lot of the science agencies,
being Bureau of Meteorology and Geoscience Australia and so forth,
also sometimes from our research community itself. But they're being
classified as really the major national reference collections that are
associated with NCI. You can see some of the things listed there,
climate, weather and satellite imagery, bathymetry, elevation, all of
these earth systems, geospatial data in particular.
As an example of a dataset now is - so we've got this thing called
Bluelink ReANalysis dataset. On the left-hand side it gives you a
summary of what it is. On the right-hand side many people are familiar
and work with catalogue systems, so we're using GeoNetwork as part
of our core catalogue system. So you get the title, so that's the blue -
you can see on the right-hand side it's circled there and an abstract
about it. You can see points of contact. So this is all part of this ISO
19115 standard, that's how all of this is recorded, how to get hold of
that data.
So the question that you've got off something like this is what
researchers are working on that, or related datasets, how they're
publishing, is there anything else connected to it. So you end up with
this little graph of stuff. Just down on the bottom right-hand side here,
just off this basic diagram here, you can see [Peter Oak], who's the
main contact for that dataset, is somehow associated with this BRAN -
Bluelink ReANalysis - dataset. So they're somehow associated with
that even off our local information. So you can find out a little bit more
about Peter. We have other information systems that have got Peter's
details, so what project he's working on, publications somehow linked
to him, his contact detail and a pretty picture there of Peter looking
very spritely.
So we have that information in NCI. So on the left-hand side, in this
dotted line that you can see with the NCI logo around it, we know a
fair bit about Peter, that's the number one with the green, there he is.
4. Page 4 of 8
There he is with his - as a researcher and an identity and attributes
inside of our local information. We know various things about datasets
that Peter is associated with. But there's other things that live outside
of NIC. In particular, on the right-hand side there, you can say out in
the real world, or out in the external world, Peter Oak has what's
called an ORCID ID, and many of you know this. Inside of - associated
with his ORCID ID we know things about his publication record.
So the trick for all of this stuff is to try and associate our internal
information to the external information. There's a number of steps that
we go through here. Number one, let's have the information recorded
inside of a little graph that we'll go through in a second. Then we can
augment the graph with how it gets connected up with the ORCID ID.
Then we can find out further information, in particular about other
external records like his publication record.
So almost redescribing this same [step] is, in a fundamental way what
we do is we've got a GeoNetwork catalogue with a lot of this
information; that is via the utilities in the Research Graph system.
Harvest that and puts it into a Neo4j, which is a type of a graph
database, just the one that we happen to be using for this. That Neo4j
is just hosted inside of the cloud. That has our information, it's just a
recasting of the local information and put inside of this system. Then
what we do is go out into a broader Research Graph on the outside
world, and we augment then the local graph database with that extra
information.
Then we can visualise it in various ways. So that's what this image -
and there is a graphical tool that comes along with this, to start seeing
a whole bunch of connected things to do with this data that can start
to be exploited. So if we just had the local information of various
datasets, then all we would have is the left-hand side of this. Through
that extra augmentation, going and querying in the international
Research Graph and then augmenting for the local data, we end up
with a much richer set of information about what each of the individual
5. Page 5 of 8
datasets and researchers and what they're doing and their
associations. So that's pretty simply what's going on.
The Research Graph system that's been put in place really by the
partners, and particularly Amir driving this, interoperates with a whole
bunch of different services; ORCID, DataCite, Skolix has come on
board, and other major datacentres like [ASIS] and so on and so forth.
So there's a list there, and a growing list, of information being put into
an interoperable graph system. So now there's richer and deeper
details that we can start harvesting. There's actually - we did the
simplest augmentation, is the description on this previous page. But,
actually, you can run several levels of augmentation and we're still I
guess trying to explore what's the best way of augmenting the data of
the questions that we're trying to face.
So, look, I'm going to hand over now to Jingbo who's going to take us
a little bit more through some of the details of Research Graph and
where it's going.
Jingbo Wang: Thank you, Ben. Hi, from this point of time I wanted to go through a
couple of slides, in the next 10 minutes or so, to demonstrate how we
implement the Research Graph [pack line]. Also, report what are we
currently working on, plus some future plans going forward. So in this
slide, it shows you what is the input and what is the output. The input
is NCI's metadata database. As you see in the previous slides by Ben,
our dataset available in GeoNetwork in various formats - it could be
CSV or XML or JSON - they are the input so that Jenkins server take
that input from the [data hub] and build the NCI graph. So the output
will be NCI graph.
On the right-hand side, the bottom screenshot just shows you how
easy to maintain and update the database with only one click of the
button. The five different modules, in green colour, shows you the
step-by-step inside of the Jenkins server to build the NCI graph and
also augmentation with other database such as a geo - [ORCID]. So
what we get eventually is an NCI graph [ML]. There are different ways
6. Page 6 of 8
to visualise the graph. One way, which was not presented here, is we
can use the [GAVI] software to visualise. But a more popular way
would be we present our graph in a web-based format.
So if you click that link or type this link in your browser, you can
actually see this is online. I'm going to show you three screenshots on
this webpage, followed by a little live demo afterwards. Basically, this
is the interesting part, once we get the graph and we're going to
analyse the graph and try to tell the story from the graph. The first
screenshot just really gives you an overview of how many publications
in our augmented graph and how many datasets and how many
researchers here. I'm going to run a little live demo to repeat the story
that Ben told you about Peter Oak. If you type this,
researchgraph.org/NCI.
Jingbo Wang: Alright, in the web browser you can see a webpage about NCI's
graph. Click that orange button, it'll open a new tab to show the graph.
This is the actual graph look like. If I find Peter Oak as a researcher
and click that one, it only shows the connection with this researcher.
The colour code of the dot is that this is the dataset which is the
Bluelink ReANalysis data associated with Peter Oak. If you notice,
there is another green dot over here and this is the augmented part
from ORCID. The blue dot represents the publication associated with
this researcher. So this really demonstrates that, through the
augmentation, our own database with the dataset and researcher are
connected to the rest of the world.
Let me go back to my presentation again. I should say that we did
play around with the different analytics and this is the most interesting
part. We demonstrate a few cases that we think people are interested.
For example, what is the most publication related to a researcher, and
this researcher is always identified with the ORCID ID. Also, which
researcher has the most dataset associated with him, with his
affiliation. On the right-hand side, if you are still with the web browser,
7. Page 7 of 8
you can actually put your mouse onto some of the name. It will only
show the connections between this researcher and other researchers.
So it's more like an interactive mode.
I should also say that this augmentation is still work in progress. It
means that we can augment with other databases, such as DataCite
or other European data repository, and we can actually make our
graph bigger and bigger. The last screenshot is just showing the
number of publications along the year. As I said, this is not a static
graph because we can always augment with other database and we
can introduce more publication if it is not in the ORCID database. So
behind the scene we use the Jupyter Notebook to generate this web
interactive format. We plan to play around more by providing maybe
predefined query, so that people can put the person's name on
ORCID, find out what is the connection between this researcher and
the publication and the dataset and, in the future, even the grants if it's
available in our database.
So next is we think that Research Graph can be useful for a number
of different groups of people. We think also providing Research Graph
in the linked-data format would be beneficial for people who want to
work with more machine-searchable and actionable approach. So
what we've done is we did a bit of proof-concept work by extending
our current format of the Research Graph in JSON to JSON-LD, using
schema.org to enhance the schematic feature of the Research Graph.
We have a publication last year talking about the approach and the
ideas, so the reference is at the bottom of the slide.
The other thing is, once we build the Research Graph there are a lot
of interesting analysis that we can do. So we are currently exploring
the new ways of analysing the information in the Research Graph and
trying to pick up the good stories about what Research Graph can tell
us. The other thing is, because we are the national data repository we
actually encourage people to do the cross-disciplinary research based
on our high-performance platform. If we can demonstrate the value of
[cross-system] and disciplinary research, by showing that when
8. Page 8 of 8
different type of dataset available on the same platform, more
research, more publication and more funding was granted, it will be
quite good to demonstrate the impact of our data management
practice.
So in summary, I think Research Graph really means a couple of
things for a different group of user. For example, for a user itself of the
data repository, they can understand the dynamic research integration
through these analytics. I remember when some researcher submit an
ARC grant, they sometimes show their publication citation along the
year being increasingly better and better. But with the Research
Graph they can actually show more information, not just publication
but also their contribution of the dataset and their award on other
additional funding using the Research Graph.
For the higher-level executive and board, as a data repository we can
demonstrate the value of our good data management practice and
provide the interoperability of the data services through these more
advanced services. We also advance the science research by having
more publication and more impact in the matrix. Finally, for the funding
body, since they invested a good amount of money for the data
repository, we can demonstrate the impact of the investment on the
data repository by showing the quantitative analysis of the impact
matrix within the research community.
So if you want to learn more about the graph, we have the GitHub
source code and we also have the interactive demo of the graph, and
there is Twitter also if you wanted to socialise it. I think that's it.
Facilitator: Okay, thanks Jingbo. I'd like to thank Ben and Jingbo for giving this
talk and thank you, everyone, for attending the webinar. Thank you.
END OF TRANSCRIPT