SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Comparing Published Scientific
Journal Articles
to Their Pre-print Versions
Martin Klein Peter Broadwell
@mart1nkle1n @peterbroadwell
with Sharon E. Farb and Todd Grappone
@farbthink, @liber8er
{martinklein,broadwell,farb,grappone}@library.ucla.edu
University of California Los Angeles
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
2
Scientific Output in Numbers
Global STM publishing market > $25 billion
• 55% of this from USA
• 28% from Europe, Middle East
• Journals core part of scholarly communication process
• English language journal revenue: ~ $10 billion
• ~ 70% of that out of libraries’ budget
• > 28k scholarly peer-reviewed journals (+3.5% p.a.)
• ~ 2.5 million articles per year (+3% p.a.)
• 21% of research papers from USA
“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
3
University of California Publication Impact
“Research Performance of the UC System,” Elsevier, March 2015
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
4
Open Access by Disciplines
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. 2010
http://dx.doi.org/10.1371/journal.pone.0011273
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
5
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
6
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
 20.4% OA rate
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
7
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
 20.4% OA rate
2015
“Open Access and Sources of Full-Text Articles in Google Scholar in Different
Subject Fields”, Hammid et al.
(http://dx.doi.org/10.1007/s11192-015-1642-2)
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
8
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
 20.4% OA rate
2015
“Open Access and Sources of Full-Text Articles in Google Scholar in Different
Subject Fields”, Hammid et al.
(http://dx.doi.org/10.1007/s11192-015-1642-2)
 61.1% OA rate
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
9
Pre-print v. Final Published
arXiv.org
• Average annual operating cost for 2013 - 2017:
$826,000
Final Published
• English language STM journals: $10 billion in 2013
http://arxiv.org/help/support/faq#3D
“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
10
Role of Publisher
• Entrepreneur
• Copyediting
• Tagging
• Marketer
• Distributor
• E-Host
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
11
Value of Publisher
“Once you’ve gone through the peer review process, if you look
at the article that is actually published in a journal, it looks
radically different [to the one submitted due to] that process of
transformation, the copy-editing, the database linking, the data
visualisation tools, making sure that the metadata for the article
is all right, so when people come to [Elsevier database]
ScienceDirect or type a search into Google, they can actually
find what they are looking for on their platforms.”
Gemma Hersh
http://www.thebookseller.com/news/elsevier-defends-its-value-after-open-access-disputes-328037
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
12
Working Assumptions
1. If the publishers’ argument is valid, the text of a
pre-print paper should vary significantly from its
corresponding post-print version.
1. By applying standard similarity measures, we
should be able to detect and quantify such
differences.
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
13
Assembling a pre-print corpus
Source: arXiv.org
• 1.1 million publication records
• Metadata (typical DC, including DOI) obtained
via OAI-PMH interface
• PDF versions of articles available via Amazon’s
S3 service (using “requester pays” option)
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
14
Finding a matching post-print corpus
1. Extract DOIs from arXiv metadata
• 44.5% or articles have DOI
2. CrossRef’s Metadata Search API
• Match by DOI
• Download article & metadata in XML/PDF
 Results in:
• 11,017 full text articles
• Majority published by Elsevier between 2003 and
2015
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
15
Text Comparison Methods
1. Length ratio
2. Levenshtein ratio
3. Cosine similarity
4. Jaccard coefficient
5. Sorensen similarity
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
16
Comparison of Sections
“Analyzing News Events in Non-Traditional Digital Library Collections” M.Klein, P.Broadwell, 2015
http://dx.doi.org/10.1145/2756406.2756948
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
17
Comparison of Sections
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
18
Title Comparison
Explore our findings at http://sologlo.library.ucla.edu/prepost
Papers
Similarity (1 = most similar)
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
1100020003000400050006000700080009000
0102030405060708090100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
19
Comparison of Sections
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
20
Abstract Comparison
Papers
Similarity (1 = most similar)
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
1100020003000400050006000700080009000
0102030405060708090100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Explore our findings at http://sologlo.library.ucla.edu/prepost
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
21
10.1016/j.physletb.2006.10.068
Physics Letters B
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
22
Comparison of Sections
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
23
Body Comparison
Papers
Similarity (1 = most similar)
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
110002000300040005000600070008000
0102030405060708090100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Explore our findings at http://sologlo.library.ucla.edu/prepost
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
24
Publication Dates
Papers
0100030005000
1−90
91−180
181−270
271−360
361−450
451−540
541−630
631−720
>720
Pre−print first
Final published first
Number of days
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
25
Assembling a pre-print corpus
Source: arXiv.org
• 1.1 million publication records
• metadata (typical DC, including DOI) obtained
via OAI-PMH interface
• PDF versions of articles available via Amazon’s
S3 service (using “requester pays” option)
• *Latest version used if multiple available*
• 35% of all arXiv papers have > 1 version
• 58% of our matched papers have > 1 version
• Repeat experiment with *earliest version*
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
26
Publication Dates of Earliest Versions
Papers
Number of days
01000200030004000
1−90
91−180
181−270
271−360
361−450
451−540
541−630
631−720
>720
Pre−print first
Final published first
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
27
Title Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1000−800−600−400−2000200
1009080706050403020100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
28
Title Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1000−800−600−400−2000200
1009080706050403020100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
29
Title Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1000−800−600−400−2000200
1009080706050403020100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
30
Abstract Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1500−1000−5000500
1009080706050403020100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
31
Body Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1500−1000−50005001000
100806040200
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
32
Discussion & Future Work
• Single corpus experiment
• Pre-print/final published matches based on:
• DOIs
• CrossRef API results
• UCLA serial subscriptions (majority Elsevier
publications)
• Expand to other disciplines/publishers
• Overlay with ISI Impact factor and usage statistics
• Refine extraction/comparison of authors and
references
• Operate at scale
Comparing Published Scientific
Journal Articles
to Their Pre-print Versions
Martin Klein Peter Broadwell
@mart1nkle1n @peterbroadwell
with Sharon E. Farb and Todd Grappone
@farbthink, @liber8er
{martinklein,broadwell,farb,grappone}@library.ucla.edu
University of California Los Angeles

Weitere ähnliche Inhalte

Was ist angesagt?

Forging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic WebForging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic Web
Gillian Byrne
 

Was ist angesagt? (20)

A replication crisis in the making: how we reward unreliable science
A replication crisis in the making: how we reward unreliable scienceA replication crisis in the making: how we reward unreliable science
A replication crisis in the making: how we reward unreliable science
 
Bibliosight Project - JournalTOCs Workshop
Bibliosight Project - JournalTOCs WorkshopBibliosight Project - JournalTOCs Workshop
Bibliosight Project - JournalTOCs Workshop
 
Why canceling subscriptions may just yet save scholarship
Why canceling subscriptions may just yet save scholarshipWhy canceling subscriptions may just yet save scholarship
Why canceling subscriptions may just yet save scholarship
 
ER&L KBART Update
ER&L KBART UpdateER&L KBART Update
ER&L KBART Update
 
Creating Pockets of Persistence
Creating Pockets of PersistenceCreating Pockets of Persistence
Creating Pockets of Persistence
 
Forging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic WebForging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic Web
 
RPI Research in Linked Open Government Systems
RPI Research in Linked Open Government SystemsRPI Research in Linked Open Government Systems
RPI Research in Linked Open Government Systems
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013
 
Semantic Web Applications in Libraries: The Road to BIBFRAME
Semantic Web Applications in Libraries: The Road to BIBFRAMESemantic Web Applications in Libraries: The Road to BIBFRAME
Semantic Web Applications in Libraries: The Road to BIBFRAME
 
BIBFRAME : the future of cataloguing?
BIBFRAME : the future of cataloguing?BIBFRAME : the future of cataloguing?
BIBFRAME : the future of cataloguing?
 
MLA CE Course: Third-Party PubMed Tools
MLA CE Course: Third-Party PubMed ToolsMLA CE Course: Third-Party PubMed Tools
MLA CE Course: Third-Party PubMed Tools
 
Third-Party PubMed Tools
Third-Party PubMed ToolsThird-Party PubMed Tools
Third-Party PubMed Tools
 
Presentation1
Presentation1Presentation1
Presentation1
 
Open Access NBIC Workshop April 19, 2011
Open Access NBIC Workshop April 19, 2011Open Access NBIC Workshop April 19, 2011
Open Access NBIC Workshop April 19, 2011
 
How to build your own citation index
How to build your own citation indexHow to build your own citation index
How to build your own citation index
 
Linked Open Data for Libraries
Linked Open Data for LibrariesLinked Open Data for Libraries
Linked Open Data for Libraries
 
Federated Search Falls Short
Federated Search Falls ShortFederated Search Falls Short
Federated Search Falls Short
 
Giving researchers credit for data
Giving researchers credit for dataGiving researchers credit for data
Giving researchers credit for data
 
Bracke may4-1
Bracke may4-1Bracke may4-1
Bracke may4-1
 
Crossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latestCrossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latest
 

Andere mochten auch (7)

Jason chinchilla
Jason chinchillaJason chinchilla
Jason chinchilla
 
Companies that produce & distribute rn b genre
Companies that produce & distribute rn b genreCompanies that produce & distribute rn b genre
Companies that produce & distribute rn b genre
 
Ood启思录01
Ood启思录01Ood启思录01
Ood启思录01
 
Carol vernallis theory
Carol vernallis theoryCarol vernallis theory
Carol vernallis theory
 
About Webtechnologies
About WebtechnologiesAbout Webtechnologies
About Webtechnologies
 
Interrogating the Politics and Performativity of Web Archiving
Interrogating the Politics and Performativity of Web ArchivingInterrogating the Politics and Performativity of Web Archiving
Interrogating the Politics and Performativity of Web Archiving
 
pi950.pdf
pi950.pdfpi950.pdf
pi950.pdf
 

Ähnlich wie Comparing Published Scientific Journal Articles to Their Pre-print Versions

Publishing and impact Wageningen University IL for PhD 20141202
Publishing and impact  Wageningen University IL for PhD 20141202Publishing and impact  Wageningen University IL for PhD 20141202
Publishing and impact Wageningen University IL for PhD 20141202
Hugo Besemer
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
Publish be cited, or perish
Publish be cited, or perishPublish be cited, or perish
Publish be cited, or perish
Wouter Gerritsma
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
Holy Cross Lunch and Learn
Holy Cross Lunch and LearnHoly Cross Lunch and Learn
Holy Cross Lunch and Learn
rachelmccullough
 

Ähnlich wie Comparing Published Scientific Journal Articles to Their Pre-print Versions (20)

Preprints: a journey though time
Preprints: a journey though timePreprints: a journey though time
Preprints: a journey though time
 
Publishing and impact Wageningen University IL for PhD 20141202
Publishing and impact  Wageningen University IL for PhD 20141202Publishing and impact  Wageningen University IL for PhD 20141202
Publishing and impact Wageningen University IL for PhD 20141202
 
British Library
British LibraryBritish Library
British Library
 
A Science Mapping Analysis Of Blood Donation Behaviour
A Science Mapping Analysis Of Blood Donation BehaviourA Science Mapping Analysis Of Blood Donation Behaviour
A Science Mapping Analysis Of Blood Donation Behaviour
 
Author workshop TU Delft 20111122
Author workshop TU Delft 20111122Author workshop TU Delft 20111122
Author workshop TU Delft 20111122
 
STRETCHING THE BOUNDARIES OF PUBLISHING: ALTERNATIVES
STRETCHING THE BOUNDARIES OF PUBLISHING: ALTERNATIVESSTRETCHING THE BOUNDARIES OF PUBLISHING: ALTERNATIVES
STRETCHING THE BOUNDARIES OF PUBLISHING: ALTERNATIVES
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Publish be cited, or perish
Publish be cited, or perishPublish be cited, or perish
Publish be cited, or perish
 
SciVerse @ TJU
SciVerse @ TJUSciVerse @ TJU
SciVerse @ TJU
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
The future of scholarly publishing: where do we go from here?
The future of scholarly publishing: where do we go from here? The future of scholarly publishing: where do we go from here?
The future of scholarly publishing: where do we go from here?
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Stevan Harnad - Scholarly/Scientific Impact Metrics in the Open Access Era
Stevan Harnad - Scholarly/Scientific Impact Metrics in the Open Access EraStevan Harnad - Scholarly/Scientific Impact Metrics in the Open Access Era
Stevan Harnad - Scholarly/Scientific Impact Metrics in the Open Access Era
 
Open Access Publishing: More Readers, More Impact
Open Access Publishing: More Readers, More ImpactOpen Access Publishing: More Readers, More Impact
Open Access Publishing: More Readers, More Impact
 
Where to publish_130709
Where to publish_130709Where to publish_130709
Where to publish_130709
 
The Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations CorpusThe Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations Corpus
 
Publishing and impact 20141028
Publishing and impact 20141028Publishing and impact 20141028
Publishing and impact 20141028
 
Eps
EpsEps
Eps
 
Science in the context of journals, Open, and the future
Science in the context of journals, Open, and the futureScience in the context of journals, Open, and the future
Science in the context of journals, Open, and the future
 
Holy Cross Lunch and Learn
Holy Cross Lunch and LearnHoly Cross Lunch and Learn
Holy Cross Lunch and Learn
 

Mehr von Martin Klein

Mehr von Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 

Kürzlich hochgeladen

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 

Kürzlich hochgeladen (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 

Comparing Published Scientific Journal Articles to Their Pre-print Versions

  • 1. Comparing Published Scientific Journal Articles to Their Pre-print Versions Martin Klein Peter Broadwell @mart1nkle1n @peterbroadwell with Sharon E. Farb and Todd Grappone @farbthink, @liber8er {martinklein,broadwell,farb,grappone}@library.ucla.edu University of California Los Angeles
  • 2. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 2 Scientific Output in Numbers Global STM publishing market > $25 billion • 55% of this from USA • 28% from Europe, Middle East • Journals core part of scholarly communication process • English language journal revenue: ~ $10 billion • ~ 70% of that out of libraries’ budget • > 28k scholarly peer-reviewed journals (+3.5% p.a.) • ~ 2.5 million articles per year (+3% p.a.) • 21% of research papers from USA “STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
  • 3. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 3 University of California Publication Impact “Research Performance of the UC System,” Elsevier, March 2015
  • 4. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 4 Open Access by Disciplines “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. 2010 http://dx.doi.org/10.1371/journal.pone.0011273
  • 5. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 5 Open Access Rate Overall 2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. (http://dx.doi.org/10.1371/journal.pone.0011273)
  • 6. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 6 Open Access Rate Overall 2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. (http://dx.doi.org/10.1371/journal.pone.0011273)  20.4% OA rate
  • 7. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 7 Open Access Rate Overall 2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. (http://dx.doi.org/10.1371/journal.pone.0011273)  20.4% OA rate 2015 “Open Access and Sources of Full-Text Articles in Google Scholar in Different Subject Fields”, Hammid et al. (http://dx.doi.org/10.1007/s11192-015-1642-2)
  • 8. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 8 Open Access Rate Overall 2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. (http://dx.doi.org/10.1371/journal.pone.0011273)  20.4% OA rate 2015 “Open Access and Sources of Full-Text Articles in Google Scholar in Different Subject Fields”, Hammid et al. (http://dx.doi.org/10.1007/s11192-015-1642-2)  61.1% OA rate
  • 9. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 9 Pre-print v. Final Published arXiv.org • Average annual operating cost for 2013 - 2017: $826,000 Final Published • English language STM journals: $10 billion in 2013 http://arxiv.org/help/support/faq#3D “STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
  • 10. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 10 Role of Publisher • Entrepreneur • Copyediting • Tagging • Marketer • Distributor • E-Host
  • 11. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 11 Value of Publisher “Once you’ve gone through the peer review process, if you look at the article that is actually published in a journal, it looks radically different [to the one submitted due to] that process of transformation, the copy-editing, the database linking, the data visualisation tools, making sure that the metadata for the article is all right, so when people come to [Elsevier database] ScienceDirect or type a search into Google, they can actually find what they are looking for on their platforms.” Gemma Hersh http://www.thebookseller.com/news/elsevier-defends-its-value-after-open-access-disputes-328037
  • 12. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 12 Working Assumptions 1. If the publishers’ argument is valid, the text of a pre-print paper should vary significantly from its corresponding post-print version. 1. By applying standard similarity measures, we should be able to detect and quantify such differences.
  • 13. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 13 Assembling a pre-print corpus Source: arXiv.org • 1.1 million publication records • Metadata (typical DC, including DOI) obtained via OAI-PMH interface • PDF versions of articles available via Amazon’s S3 service (using “requester pays” option)
  • 14. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 14 Finding a matching post-print corpus 1. Extract DOIs from arXiv metadata • 44.5% or articles have DOI 2. CrossRef’s Metadata Search API • Match by DOI • Download article & metadata in XML/PDF  Results in: • 11,017 full text articles • Majority published by Elsevier between 2003 and 2015
  • 15. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 15 Text Comparison Methods 1. Length ratio 2. Levenshtein ratio 3. Cosine similarity 4. Jaccard coefficient 5. Sorensen similarity
  • 16. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 16 Comparison of Sections “Analyzing News Events in Non-Traditional Digital Library Collections” M.Klein, P.Broadwell, 2015 http://dx.doi.org/10.1145/2756406.2756948
  • 17. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 17 Comparison of Sections
  • 18. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 18 Title Comparison Explore our findings at http://sologlo.library.ucla.edu/prepost Papers Similarity (1 = most similar) %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 1100020003000400050006000700080009000 0102030405060708090100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 19. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 19 Comparison of Sections
  • 20. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 20 Abstract Comparison Papers Similarity (1 = most similar) %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 1100020003000400050006000700080009000 0102030405060708090100 Length Levenshtein Cosine Sorensen Jaccard Percentage Explore our findings at http://sologlo.library.ucla.edu/prepost
  • 21. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 21 10.1016/j.physletb.2006.10.068 Physics Letters B
  • 22. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 22 Comparison of Sections
  • 23. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 23 Body Comparison Papers Similarity (1 = most similar) %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 110002000300040005000600070008000 0102030405060708090100 Length Levenshtein Cosine Sorensen Jaccard Percentage Explore our findings at http://sologlo.library.ucla.edu/prepost
  • 24. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 24 Publication Dates Papers 0100030005000 1−90 91−180 181−270 271−360 361−450 451−540 541−630 631−720 >720 Pre−print first Final published first Number of days
  • 25. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 25 Assembling a pre-print corpus Source: arXiv.org • 1.1 million publication records • metadata (typical DC, including DOI) obtained via OAI-PMH interface • PDF versions of articles available via Amazon’s S3 service (using “requester pays” option) • *Latest version used if multiple available* • 35% of all arXiv papers have > 1 version • 58% of our matched papers have > 1 version • Repeat experiment with *earliest version*
  • 26. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 26 Publication Dates of Earliest Versions Papers Number of days 01000200030004000 1−90 91−180 181−270 271−360 361−450 451−540 541−630 631−720 >720 Pre−print first Final published first
  • 27. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 27 Title Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1000−800−600−400−2000200 1009080706050403020100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 28. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 28 Title Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1000−800−600−400−2000200 1009080706050403020100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 29. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 29 Title Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1000−800−600−400−2000200 1009080706050403020100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 30. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 30 Abstract Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1500−1000−5000500 1009080706050403020100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 31. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 31 Body Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1500−1000−50005001000 100806040200 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 32. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 32 Discussion & Future Work • Single corpus experiment • Pre-print/final published matches based on: • DOIs • CrossRef API results • UCLA serial subscriptions (majority Elsevier publications) • Expand to other disciplines/publishers • Overlay with ISI Impact factor and usage statistics • Refine extraction/comparison of authors and references • Operate at scale
  • 33. Comparing Published Scientific Journal Articles to Their Pre-print Versions Martin Klein Peter Broadwell @mart1nkle1n @peterbroadwell with Sharon E. Farb and Todd Grappone @farbthink, @liber8er {martinklein,broadwell,farb,grappone}@library.ucla.edu University of California Los Angeles