The document summarizes the first year of the PLOS Thesaurus project. It describes how PLOS worked with Access Innovations to build a new, improved thesaurus to enhance discovery of PLOS articles. The thesaurus launched in 2013 and now contains over 10,000 terms to index PLOS articles. PLOS continues to refine the thesaurus and its applications, such as relative metrics and improving the peer review process.
5. Overview – today’s talk
The Solution: Good Thesaurus + Machine Aided Indexing
Building the new Thesaurus with AI
The initial implementation at plos.org
MAIstro integration into Publishing workflow
Thesaurus maintenance
The Service:
Content Discovery
Article Analysis
Relative Metrics
5
6. Starting point
2011 – the old Taxonomy
Inadequate
in content – just over 3100 specific terms
Inflexible
in structure – terms in pre-defined paths
Housed in Editorial Manager
ossified and difficult to update
Author-chosen terms - association with article
6
7. PLOS delivered to Access Innovations….
A copy of the old PLOS Taxonomy
Over 2,000 suggested changes
“Research analysis and methods” branch request
Use cases:
Subject Area-based searches
Hierarchy-based exploration of our corpus
Email Alerts based on Subject Area searches
RSS Feeds based on Subject Areas
7
8. Access Innovations added:
STEM vocabulary
Broader/Narrower term relationships
Rules for the Machine Aided Indexing
Synonyms
Analysis with respect to the PLOS corpus
.....to and fro with PLOS ….
Result:
Vastly improved NISO Z-39.19-compliant thesaurus
8
10. Top-level Terms
1. Biology and life sciences
2. Computer and information sciences
3. Earth sciences
4. Engineering and technology
5. Environmental sciences and ecology
6. Medicine and health sciences
7. Physical sciences
8. Research and analysis methods
9. Science policy
10. Social sciences
10
11. Infrastructure
PLOS Taxonomy server:
Thesaurus – plos2012thes
Data Harmony Thesaurus Master and
MAI Rule Builder
Corpus fed to the Taxonomy Server for MAI
Article by article
Initial implementation:
Title – Abstract - Results – Methods
Top 8 hits selected
11
13. 13
Learning curve – teething troubles
Not all articles had Subject Area terms – why not?
Initial implementation – text to index:
Title + Abstract* + Results + Methods
Upon consideration – text to index:
Full Text (though not references)
Implementation of “all paths”
Polyhierarchy implications
14. Consider “White blood cells”
Biology and life sciences Medicine and health sciences
Immunology Immunology
Immune cells Immune cells
White blood cells White blood cells
Biology and life sciences Biology and life sciences
Cell biology Cell biology
Cellular types Cellular types
Animal cells Animal cells
Blood cells Immune cells
White blood cells White blood cells
14
The polyhierarchy and Search
15. 15
Establishing update cycle - articles:
Initial implementation:
Entire back-corpus indexed at once
New Papers:
PLOS submits text to MAIstro at publication
MAI returns terms and term frequencies
PLOS stores terms in search engine
16. 16
Establishing update cycle - thesauri:
Separate instances (nerves):
Production server – plosthes.2013-6
Working version – plosthes.2013-7
When ready to release a new version:
Load onto test server – MAI corpus - Index
Test: new/changed/deleted terms
rule changes
structural changes
any implementation changes
26. 26
Thesaurus updates – prioritisation?
Miss-hits and missed term reports:
Ourselves:
article pages
Our readers:
in email
complaints in twitter
in correspondence with our editorial staff
via Journal and Saved Search alerts
via article pages – Flagged Term reports
28. 28
Things we learned – Thesaurus editorial
Tension:
strict and rigorous taxonomy/ontology construction
vs
user utility
Abbreviations and Synonyms
Issues that continue to exercise us:
T cells/Memory T cells
Obesity/Childhood obesity
When should we make both explicit?
Rule work – working to top 8
50. Relative Metrics:
Defining a Paper’s Peer Group
1. Group papers by Subject Area
Accommodate multiple topics per paper
2. Group papers by age
Important for comparison of cumulative measures like
total downloads or citations
3. Determine norms for peer group
The average usage of each paper is compared with the
median usage of its peer group
More on Relative Metrics at:
http://www.plosone.org/static/almInfo#relativeMetrics
50
56. The PLOS Thesaurus and Peer Review
Maintaining a copy of the PLOS thesaurus in Editorial
Manager helps with editor and reviewer matching
56
Classifications for
People
Classifications for
Papers
57. The PLOS Thesaurus and Peer Review
• Authors select Subject Area terms related to their article
submissions
• Editors and Reviewers select terms that represent their
areas of expertise
• Staff and Editors use these terms to help ensure editors
and reviewers are well matched to the submissions they
are handling
57
58. Planned Enhancements
• Automate the application of terms associated with
Editors, Reviewers and submitted articles with MAIstro
• Provide Editors and Staff with detailed terms to assist
with reviewer selection and vetting
– Academic disciplines help Editors gauge Subject Area
relevance of potential Reviewers
– Methods, protocols and model organisms help Editors
gauge technical suitability of potential Reviewers
58
59. 59
Jonas Dupuich Product Manager
Patrick Polischuk Product Manager
Sebastian Toomey Interaction Designer
Jennifer Lin Senior Product Manager
Martin Fenner ALM Technical Lead
Kallie Huss Senior Publications Assistant
John Chodacki Director - Product Management
Dramatis personae: