Mashspa

Open Bibliography,
And why it shouldn't have to exist.
Ben O'Steen
“Mashspa” Mashed Libraries, Bath 29/10/2010
CC-By

Morning,
(don't worry, I'll be quick...)

Urgh, “Open” - what
does that mean?

Publishing bibliographic
information under a
permissive license to
encourage indexing, re-
use, and re-purposing.

In essence, an open
bibliography is all about
Advertising

Bibliographic info allows you to
● Identify and find an item you know you want

● Identify and find an item you know you want,
● Discover related items or items you believe you
want

want
● Serendipitously discover items you would like
without knowing they might exist
● And so on.

want
● Serendipitously discover items you would like
without knowing they might exist
● And so on.
Requires
Increasing
Investment!

Advertising 'proverb'
You never spend money on
advertising;
you invest with an expectation of
return on investment

To maximise returns, you
maximise the audience.

Should the advertising target
'b2b' or 'consumers'?

One thing I am not saying
must be necessary...

But, by not making
bibliographic data open, you
limit the audience.
(You also limit the data quality, but more on that
later.)

“Can't I just scrap sites and
reuse it anyway? It's just facts
after all...”

“Directive 96/9/EC of the European Parliament and
of the Council of 11 March 1996 on the legal
protection of databases”
http://is.gd/gqkqb

Databases have in the past been defended using
Copyright laws.
This new law codifies a new protection based on
“sui generis”* rights, rights earned by the “sweat
of the brow”
* http://en.wikipedia.org/wiki/Sui_generis

So far, noone seems to have any
evidence that this encouraged
database-based economies.
There is evidence that it 'awarded'
unending monopolies on existing
datasets.

Due to fluffy wording, it is a
timebomb
It is a right, like copyright, that
doesn't need to be defended
and can be assumed for almost
any aggregation.

When we asked UK PubMedCentral if we could
reproduce the bibliographic data they share through
their OAI-PMH service.
They said “Generally, No”*
(*me paraphrasing that they had non-transferable
licenses and contracts yada yada. Their 'OA subset' of
1876 journals is available however, mainly BMC.)

From OAI-PMH specification:
* Data Providers administer systems that support
the OAI-PMH as a means of exposing metadata; and
* Service Providers use metadata harvested via the
OAI-PMH as a basis for building value-added
services.
http://www.openarchives.org/OAI/openarchivesproto
col.html

“… Service Providers use metadata
harvested via the OAI-PMH as a basis
for building value-added
services.”
And the survey said...

Open Bibliographic principles
http://openbiblio.net/2010/10/15/principles-for-
open-bibliographic-data/

1 -When publishing data make
an explicit and robust license
statement.

2 -Use a recognized waiver or
license that is appropriate for
metadata.

3 - If you want your data to be
effectively used and added to
by others it should be open …
– in particular non-commercial
and other restrictive clauses
should not be used.

4 - We strongly recommend
explicitly placing bibliographic
data in the Public Domain via
PDDL or CC0.

5 – We strongly urge creators
of bibliographic metadata
explicitly either dedicate this to
the public domain or use an
open licence.

Identify
Title, Date, Any identifiers,
Publisher, Container (eg
Journal), Author names etc
Discover Keywords, Abstract, Author
Identifiers, etc
Serendipity Citations, citing text, Usage
data, supplemental data, etc.
Bibliographic Sliding Scale

Identify
Discover
Serendipity
Increasing
Investment
BUT
Increased
Chance of
usage
Bibliographic Sliding Scale

“So, we just pick a standard and
publish and we'll reap all the
benefits, right?”

Erm, no.
For three main reasons.

#1 “Where there is human
input, there is
interpretation”
Meanings of words and
usage of fields have changed
over time.

#1 (cont.) Interchange
standards don't make the
information any more
understandable.
Someone interprets them.

#2 Data has been entered
and curated without large-
scale sharing as a focus.
Lots of implicit, contextual
info left out.

#3 Data quality is typically
poor with formally closed
datasets.

For #1 - Collisions caused by
interpretation can really only be
solved by sharing data and seeing
how bad things are.

Standards and interoperability:
“The first follower transforms a
lone nut into a leader” -
Derek Sivers' TED Talk
http://www.ted.com/talks/lang/eng/derek_sivers_how_to_start_a_move
ment.html

Video:
http://www.youtube.com/watch?v=GA8z7f7a2Pk
The man dancing is joined by one or two, but he is
still doing his own thing.
Eventually a group decides to join him, and the
group grows.
The quality of the dance isn't important, but the
community dancing along with it is.
And so it is with standards.

For #2 (implicit info), provenance
and the source of data gives us
crucial clues.
Due to #1, I remain unconvinced
that this information can ever be
totally machine-readable.

And for #3, misleading or incorrect
data...
… um.
No easy answers – we just don't have
the info.

The data clean-up process is going to be
probabalistic.
(We cannot be sure – by definition - that we are
'accurate' when we de-duplicate or disambiguate.)

Typical methods then:
Natural Language Processing,
Machine learning techniques
and
String Metrics and old skool record deduplication

I <3 String Metrics and old
skool record deduplication
(out of the 3)

http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stri
ngmetrics.html
http://is.gd/gqOjQ

Old skool record linkage:
“Felligi-Sunter” - probabilistic record
linkage (PRL).
It's not a great model, but it's achievable.
Machine-learning requires a reasonably large golden
set.
(http://en.wikipedia.org/wiki/Record_linkage)

PRL is not great in itself, BUT
It does lend itself to Map-Reduce style operations
And
It's a great way to filter down to those records that
really do need to be compared by eye.

http://datamining.anu.edu.au/projects/linkage.html
“Record or data linkage techniques are used to link
together records which relate to the same entity (e.g.
patient, customer, household) in one or more data
sets where a unique identifier for each entity is not
available in all or any of the data sets to be linked.”
ANU's Febrl python code

So far, much effort has been directed at the Works;
We need to put much more effort into their
Networks.
Bibliographic directions

Networks?
● A cites B
● Works by a given (identified) Author
● Works cited by a given Author
● Works citing articles that have since been disproved,
redacted or withdrawn.
● Co-authors
● And many more connections we've not even
considered yet ('betweenness', 'centrality', etc)

In Summary,
● Accessible Bibliography as Advertising.
● Bibliography authors choose how they wish to invest to gain usage
and real impact.
● Closed data has a much slimmer chance of increasing in quality
● Open data makes it easier to find problems and to improve the data
● Benefits will come from developing networks of information
● Don't get hung up on standards! A lone nut with followers doing
something copyable is enough!

Mashspa

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Mashspa

Ähnlich wie Mashspa (20)

Mehr von benosteen

Mehr von benosteen (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mashspa