One week project, out of curiosity: This presentation analyzes more than 300,000 abstracts from PubMed to obtain common themes and trends in BioTech research. The bulk of this analysis was performed using natural language processing (NLP), and machine learning (ML) on the titles and abstract contents.
I was able to derive that a paper's abstract alone is very predictive of future impact (by citation count).
1. Fletcher Series. 2016 Aug 26;1(1-10)
Abstracts Matter. But...
How much so?
Rascon CA1
1cynthia.alexander@gmail.com, San Francisco CA, 94105, USA.
Abstractff
The number of times a scientific paper is cited (citations count) has emerged as proxy of a paper’s
success within its field. Here, I aim to address how relevant is an abstract to a scientific publication,
and furthermore which features of such abstracts play the largest impact in a paper’s success (as
estimated by citations count).
The data set comprised all abstracts of scientific papers from 22 top biotech journals published in
the period of 1995-2016, a total of 310,175 papers. Journals name or the affiliation of the heads of
laboratories where not incorporated in this model, which aimed to be solely based on the abstracts
title and content. Data cleaning, and feature engineering largely relying on NLP metrics (LSA, Tf-idf,
POS-tagger), gave an good insight on what better predicts citation count across the
2. Biotech papers have a steady
trending curve
Figure 1. Number of citations per paper by year of publishing. The corpus data set after
cleaning is comprised by 202,173 abstracts. Each cyan dot represents a single paper
(transparency 0.3).
3. A journal prestige is dependent
on its impact factor
Figure 2. Journals used for the data set and the number of citations per paper
published between 1995-2010 shown as a violin plot. This differences reflect to some
extent each journals impact factor (the yearly average number of citations).
4. Figure 3. Final set of 134,374 papers (1995-2010). The
total number of citations per paper, (target, y), was
binned in two classes: under or over 10 total citations
since the paper’s publishing date (0 or 1, respectively).
(left side: Example of an Abstract and citation count)
.
Abstracts binned in two classes:
0 for 1-9 (25%), or 1 for 10 or more (75%) total citations
5. LAS, Tf-idf, and Positional Tagging
selected as star features, with Random
Forests as the model of choiceR
Figure 4. ROC and Precision/Recall curves for the top performing models.
6. Model over the last 5 years (2005-2009)
to predict the ‘success’ of 2010 papers:R
Figure 5. ROC and Precision/Recall curves for the top performing models. This time
modeling on 2005-2009 papers to predict 2010 papers ‘success’.
7. Features identified as important by RF for
predicting coming years’ papers success:
Figure 6. Feature importances as ranked by Random Forests, for a model trained on 2005-2009 and
tested on 2010 papers. *Abstract LSA (100 comp.), **Abstract LSA on Tfidf (100 comp.), *** in Title LSA
C2- **
C2- *
C4- *
C7- **
C4- **
POS tag ‘:’
C8- **
C5- **
Abstract length
C3- **
C1- *
C31-***
C15- **
C15- *
C14- *
C16- **
C3- *
C6- *
POS tag ‘.’
C29- **
1st – Next Generation Sequencing
sequenc: 0.20, method: 0.17, data: 0.16, genom: 0.16, avail: 0.14
2nd – Cellular regulation / gene
expression
cell: 0.71, activ: 0.19, induc: 0.08, regul: 0.08, mice: 0.07
3rd – Cellular models (methods)
cell: 0.28, use: 0.23, data: 0.19, method: 0.17, model: 0.16
4th – Applied genomics (mutants)
genom: 0.25, sequenc: 0.25, protein: 0.19,mutant: 0.12, human: 0.11
5th – Basic research (DNA related)
gene: 0.28, dna: 0.27, rna: 0.20, transcript: 0.20, genom: 0.17
8. Abstracts matter about:
81%
Need to consider:
Are better scientist simply better communicators?
Or… Great scientist are also really good at
communicating?
I did not incorporate a feature to account for
novelty. (quite the opposite)
It is circular to say the more papers exist in a filed
the more likely it is to be cited in the future.
However this suggests that trends exist in
academia. *duh*
9. Abstracts matter about:
81%
Future directions:
Multi-class case
Extend prediction forecast window. 2017??
Examine those abstracts in which the model did
poorly.
Flask app to ‘score’ new abstracts.
Time series, model topic trends over time. Is it too
early or is it too late for a paper to come out?
Hinweis der Redaktion
The impact factor (IF) of an academic journal is a measure reflecting the yearly average number of citations to recent articles published in that journal.
Took some time to get to this curve, data cleaning