Optimising the use of existing knowledge

Defragmentation:
Maximising the Use of
Existing Knowledge
Jan Velterop — APE 2015 — Berlin 21 January 2015

It is a means
to reach the goal

Maximal usefulness of existing
scientific research results in order to
achieve:
efficient, fast, and effective new
knowledge creation and discovery
i.e. highest
possible return on
public investment

optimal dissemination…
…of knowledge

The ultimate goal, to which
Open Access is merely a
means, may not be widely
understood – by publishers
The ultimate goal, to which
Open Access is merely a
means, may not be widely
understood

That may be why there are
a lot of different
interpretations of what
Open Access actually is
(in spite of the clear definition given in the
Budapest Open Access Initiative)

The fact that not all published
research is accessible to all
researchers, leads to ‘lamp
post research’

Looking merely at the literature that
one can access – which is not
necessarily the literature that is
potentially important to one’s research
Lamp post research:

In year Cumulative
Number of abstracts in PubMed
11,135,542

In year Cumulative
Number of abstracts in PubMed
…averaging
more than 2
abstracts
added every
minute
in 2014…

On the impossibility of being expert
341 doi: http://dx.doi.org/10.1136/bmj.c6815 (Published 14 Dece
More scientific and medical papers are being
published now than ever before.
Authors Alan G Fraser and Frank D Dunstan
think that new strategies are needed to deal with
this avalanche of information
new strategies are needed

How does a researcher
decide what’s ‘relevant’
anyway?

How are we filtering or
choosing?

Possible solutions?
Publish fewer articles
Don’t be ridiculous!
Find better ways to decide what’s
truly relevant
Now you’re talking!

We need the equivalent of aerial surveys
— ‘knowledge drones’? —
Some of my professors were already known as
‘knowledge drones’ :-)

How might we create overviews?

Getting the picture from a large number of data points
‘Whole-o-gram’

Getting a better picture from even more data points

It’s not just about finding
information
It’s also – and possibly more –
about the value & power of
‘recombinant knowledge’

Saving significant time-to-knowledge
After analysis in BRAIN: 4 minutes
Arriving at this conclusion (review in Frontiers Immunology)
after reading 221 papers: weeks
5
“Chronic immune activation is the primary driver in HIV pathogenesis”

What stands in the way?
different…
• publishers
• journals
• platforms
• licences
• formats
• silos
• languages
First of all: fragmentationAnd also, of course: access
(lack of)
Not to the whole article…but to the data
and assertions buried in them

Plenty of initiatives to find stuff:
• PubChase – Open Access Biomedical Journal
Reference Library
• Paperity
• SciLit – Database of Scientific and Scholarly
Literature
• Google Scholar
• Et cetera
Some go further:
• Europe PubMed Central – offering semantic
tools

0
1000000
2000000
3000000
4000000
Title
Full-text in PMC
of which with CC-licence
all full-text articles in PubMedCentral (100%)
all articles with CC-licences (11.9%)
all articles with CC-BY licences (8.7%)
3,087,430
366,973 270,114
Europe-PMC, 19 December 2014“The majority of articles in PMC are subject to traditional copyright restrictions”
Not many ‘true’ open access:

What we need is information
extracted from as many articles as
possible
The more we have, the ‘sharper’
the knowledge picture

Fragmentation and lack of access are
encumbrances to seamless knowledge-
pattern-analyses and themed collection
building (e.g. of graphs)…
…which are fast becoming an absolute
necessity due to the vast amounts of
published material, growing every year,
and, of course, in the aggregate

“As the rate of publishing accelerates,
the need for computational support to
work out which articles to read, and how
to interpret, reproduce and validate the
claims they contain is growing.”
Quote from ‘Lazarus’:
http://www.bbsrc.ac.uk/pa/grants/AwardDetails.aspx?FundingReference=BB/L005298/1

Traditional publications are aimed at
consumption by humans;
“stories that persuade with data”*
Not easily amenable to
machine-processing
* Anita de Waard, Elsevier

In the life-science literature, we typically find:
• drug-like molecules represented as illustrations;
• biochemical properties as tables or graphs;
• protein/DNA sequences buried amongst text;
• references and citations with arcane formats;
• other objects of biological interest being given
ambiguous names.
And, horrors like this (from PLOS, h/t Peter Murray-Rust):
+ (plus underscored) isn’t the same as ± (plus-minus)!

• re-type figures from tables;
• chase citations through digital
libraries;
• redraw molecules by hand;
• et cetera.
tedious, error-prone, wasteful
scientists should be able to use
their precious time better
This creates the need to:

ocuments
Via UD, LAZARUS ‘resurrects’ knowledge from being
buried in articles:
• entities (‘concepts’, incl. synonyms, e.g. proteins)
• phrases, statements, assertions (e.g. triples)
• molecules (incl. Markush structure groups)
• graphs
• tables http://utopiadocs.com

• entities (‘concepts’, incl. synonyms, e.g. proteins)
• phrases, statements, assertions (e.g. triples)
• molecules (incl. Markush structure groups)
• graphs
• tables
These are captured – with their provenance, e.g.
DOI – in a ‘Knowledge Graph’ of their relationships
When assertions are captured, they are compared to
the Knowledge Graph and labelled as ‘new’ (to the
Graph) or ‘already found earlier’

“Lazarus to harness the crowd reading life-
science articles to resurrect the swathes of
legacy data buried in charts, tables, diagrams
and free-text, to liberate processable data into a
shared resource that benefits the community.”
“…activities currently carried out anyway by
individuals for their own purposes (annotating,
cross-referencing articles with databases,
organising collections of articles).”

VHL protein binds to HIF-α which is ubiquitinated and tagged for degradation in the proteasome.

These ‘assertions’ form the ‘knowledge
profile’ of an article, and are added to a
growing ‘knowledge graph’ which can
be analysed for trends, clusters, areas
of intensive activity, et cetera.

Some other initiatives to bring
the open literature together so
that it can be used for large
scale semantic analyses:

libraccess.org
The goal of Libraccess is to
aggregate, de-duplicate, clean and
index scientific resources in open
access repositories, from
all countries, from all disciplines,
and make them available to all,
through a website and with APIs.

Research Pad
Open Access Journal Reference Library
(www.researchpad.co)

Converting all that’s open (CC-BY) into ePub format
for tablets and smartphones.
What I find most interesting, however, is their plan*
to make the whole body of all literature that’s openly
accessible available in XML for semantic analysis†
* being worked on as we speak, they confirmed to me
†
I hope they will add the ‘knowledge profiles’ of paywalled
articles created by Lazarus

Build collection of favouritesRead full textshare with othersInspect metrics

sales@newgen.co technical inquiries: patrick@newgen.co

Thank you
Jan Velterop — APE 2015 — Berlin 21 January 2015
velterop@me.com

Optimising the use of existing knowledge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Optimising the use of existing knowledge

Similar to Optimising the use of existing knowledge (20)

Recently uploaded

Recently uploaded (20)

Optimising the use of existing knowledge