8. Bibliographic info allows you to
● Identify and find an item you know you want,
● Discover related items or items you believe you
want
9. Bibliographic info allows you to
● Identify and find an item you know you want,
● Discover related items or items you believe you
want
● Serendipitously discover items you would like
without knowing they might exist
● And so on.
10. Bibliographic info allows you to
● Identify and find an item you know you want,
● Discover related items or items you believe you
want
● Serendipitously discover items you would like
without knowing they might exist
● And so on.
Requires
Increasing
Investment!
14. One thing I am not saying
must be necessary...
15.
16. But, by not making
bibliographic data open, you
limit the audience.
(You also limit the data quality, but more on that
later.)
17. “Can't I just scrap sites and
reuse it anyway? It's just facts
after all...”
18. “Directive 96/9/EC of the European Parliament and
of the Council of 11 March 1996 on the legal
protection of databases”
http://is.gd/gqkqb
19. Databases have in the past been defended using
Copyright laws.
This new law codifies a new protection based on
“sui generis”* rights, rights earned by the “sweat
of the brow”
* http://en.wikipedia.org/wiki/Sui_generis
20. So far, noone seems to have any
evidence that this encouraged
database-based economies.
There is evidence that it 'awarded'
unending monopolies on existing
datasets.
21. Due to fluffy wording, it is a
timebomb
It is a right, like copyright, that
doesn't need to be defended
and can be assumed for almost
any aggregation.
22. When we asked UK PubMedCentral if we could
reproduce the bibliographic data they share through
their OAI-PMH service.
They said “Generally, No”*
(*me paraphrasing that they had non-transferable
licenses and contracts yada yada. Their 'OA subset' of
1876 journals is available however, mainly BMC.)
23. From OAI-PMH specification:
* Data Providers administer systems that support
the OAI-PMH as a means of exposing metadata; and
* Service Providers use metadata harvested via the
OAI-PMH as a basis for building value-added
services.
http://www.openarchives.org/OAI/openarchivesproto
col.html
24. “… Service Providers use metadata
harvested via the OAI-PMH as a basis
for building value-added
services.”
And the survey said...
28. 2 -Use a recognized waiver or
license that is appropriate for
metadata.
29. 3 - If you want your data to be
effectively used and added to
by others it should be open …
– in particular non-commercial
and other restrictive clauses
should not be used.
30. 4 - We strongly recommend
explicitly placing bibliographic
data in the Public Domain via
PDDL or CC0.
31. 5 – We strongly urge creators
of bibliographic metadata
explicitly either dedicate this to
the public domain or use an
open licence.
38. #2 Data has been entered
and curated without large-
scale sharing as a focus.
Lots of implicit, contextual
info left out.
39. #3 Data quality is typically
poor with formally closed
datasets.
40.
41. For #1 - Collisions caused by
interpretation can really only be
solved by sharing data and seeing
how bad things are.
42. Standards and interoperability:
“The first follower transforms a
lone nut into a leader” -
Derek Sivers' TED Talk
http://www.ted.com/talks/lang/eng/derek_sivers_how_to_start_a_move
ment.html
43. Video:
http://www.youtube.com/watch?v=GA8z7f7a2Pk
The man dancing is joined by one or two, but he is
still doing his own thing.
Eventually a group decides to join him, and the
group grows.
The quality of the dance isn't important, but the
community dancing along with it is.
And so it is with standards.
44. For #2 (implicit info), provenance
and the source of data gives us
crucial clues.
Due to #1, I remain unconvinced
that this information can ever be
totally machine-readable.
45. And for #3, misleading or incorrect
data...
… um.
No easy answers – we just don't have
the info.
46. The data clean-up process is going to be
probabalistic.
(We cannot be sure – by definition - that we are
'accurate' when we de-duplicate or disambiguate.)
47. Typical methods then:
Natural Language Processing,
Machine learning techniques
and
String Metrics and old skool record deduplication
48. I <3 String Metrics and old
skool record deduplication
(out of the 3)
51. Old skool record linkage:
“Felligi-Sunter” - probabilistic record
linkage (PRL).
It's not a great model, but it's achievable.
Machine-learning requires a reasonably large golden
set.
(http://en.wikipedia.org/wiki/Record_linkage)
52. PRL is not great in itself, BUT
It does lend itself to Map-Reduce style operations
And
It's a great way to filter down to those records that
really do need to be compared by eye.
53. http://datamining.anu.edu.au/projects/linkage.html
“Record or data linkage techniques are used to link
together records which relate to the same entity (e.g.
patient, customer, household) in one or more data
sets where a unique identifier for each entity is not
available in all or any of the data sets to be linked.”
ANU's Febrl python code
54. So far, much effort has been directed at the Works;
We need to put much more effort into their
Networks.
Bibliographic directions
57. Networks?
● A cites B
● Works by a given (identified) Author
● Works cited by a given Author
● Works citing articles that have since been disproved,
redacted or withdrawn.
● Co-authors
● And many more connections we've not even
considered yet ('betweenness', 'centrality', etc)
58. In Summary,
● Accessible Bibliography as Advertising.
● Bibliography authors choose how they wish to invest to gain usage
and real impact.
● Closed data has a much slimmer chance of increasing in quality
● Open data makes it easier to find problems and to improve the data
● Benefits will come from developing networks of information
● Don't get hung up on standards! A lone nut with followers doing
something copyable is enough!