This document analyzes the representativeness of different parts of scientific documents, including abstracts and sections related to the approach, outcome, and background. It finds that summaries created from the approach, outcome, or background better represent the full document and related documents than abstracts, based on measures of internal and external representativeness. Future work will use probabilistic topic models better suited to short texts.
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
An initial analysis of topic-based similarity among scientific documents based on their rhetorical discourse parts
1. An initial Analysis of
Topic-based Similarity
among Scientific Documents
based on their
Rhetorical Discourse Parts
ocorcho@fi.upm.es
@ocorcho ISWC’17
oeg-upm.net
Carlos Badenes-Olmedo
Jose Luis Redondo-Garcia
Oscar Corcho
Ontology Engineering Group
Universidad Politécnica de Madrid
Spain
2. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Motivation
2
How representative is an abstract?
Scientific Research
Practitioners
Reviewers
3. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Motivation
3
How representative are summaries based
on scientific discourse categories?
Scientific Research
Practitioners
Reviewers
approach
challenge
background
outcome
future work
4. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Representativeness
4
Full-Paper
Summary
Internal
External
finding related items
describing main ideas
5. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Probabilistic Topic Models
5
• Each document is a mixture of corpus-wide topics
• Each topic is a distribution over words
• Each word is drawn from one of those topics
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research,
Latent Dirichlet Allocation (LDA)
6. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Representativeness Measure
6
Internal
External
precision / recall / f-measure
JSD-based similarity
[d1,d2,d3,..dn] [s1,s2,s3,..sn]
[h1,h2,..hn] [j1,j2,..jn] [j1,j2,..jn] [k1,k2,..kn] [m1,m2,..mn]
Full-Paper Summary
JSD-based
similarity
JSD-based
similarity
• Feature vectors in Topic Models are topic distributions expressed as vectors of probabi
• The similarity measure used in our analysis is based on the Jensen Shannon-Divergen
7. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
7
Advances in
Space Research
Procedia
Chemistry
Journal of
Pharmaceutical Analysis
Journal of
Web Semantics
Elsevier API
1000 papers
( + abstracts)
Topic
Model
discover
rhetorical
parts
training (only full-papers)
inference
1000 papers
( + abstracts,
+ discourse parts)
network of related papers
( + abstracts + discourse parts)
8. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
8
Advances in Space Research
Corpus
Procedia Chemistry
Corpus
Journal of Pharmaceutical
Analysis Corpus
Journal of Web Semantics
Corpus
• http://librairy.linkeddata.es/resources/domains/aisr
Test
Corpus
• http://librairy.linkeddata.es/resources/domains/pc
• http://librairy.linkeddata.es/resources/domains/jopa
• http://librairy.linkeddata.es/resources/domains/jows
• http://librairy.linkeddata.es/resources/domains/group1
• Topics in a Corpus:
http://librairy.linkeddata.es/resources/domains/group1/topics?words=10
• Papers in a Corpus:
http://librairy.linkeddata.es/resources/domains/group1/items?size=10
Explore a Corpus
9. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
9
Full-Paper
• Info:
http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106?content=true
• Parts:
http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106/parts
• abstract:
http://librairy.linkeddata.es/resources/parts/adfe85d9634654e4cfd7148be7cd2b29?content=true
• approach:
http://librairy.linkeddata.es/resources/parts/83f2b9722953034d7b6b50cbead4ec6b?content=true
• outcome:
http://librairy.linkeddata.es/resources/parts/61452a5ec420c8926160ae748c12a826?content=true
• challenge:
http://librairy.linkeddata.es/resources/parts/8858ef323fc09efbdcd46b9de45f146c?content=true
• background:
http://librairy.linkeddata.es/resources/parts/d118ef60d5e874d69d92c6b07be68b61?content=true
• future-work:
http://librairy.linkeddata.es/resources/parts/92be5400df5bb331e5f7f692e6b05bca?content=true
• Topic Distribution of Full-Paper:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-84924147106/topics?words=15
• Topic Distribution of Abstract:
http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/topics?words=15
• Similarity between Full-Paper and Abstract:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&relatedId=adfe85d9634654e4cfd7148be7cd2b29
• Similarity between Full-Paper and Approach content:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&relatedId=83f2b9722953034d7b6b50cbead4ec6b
Internal Representativeness
10. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
10
• Similar papers to Full-Paper:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&resourceType=item&size=5
• Similar papers to Abstract:
http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila
rity&resourceType=item&size=5
• Similar papers to Approach content:
http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil
arity&resourceType=item&size=5
• Similar summaries to a Full-Paper:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&resourceType=part&size=5
• Similar summaries to an Abstract:
http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila
rity&resourceType=part&size=5
• Similar summaries to Approach:
http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil
arity&resourceType=part&size=5
External Representativeness
11. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: Size of Summaries
11
The approach, the background
and the outcome content of a
paper generate more accurate
topic distributions than those
created from other approaches
as the abstract.
Since LDA considers documents
as bag-of-words, the text length
affects the accuracy of the topic
distributions inferred by the
model
Relative size of summaries respect to full-paper
Absolute size of summaries (in number of characters)
12. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: Internal Representativeness
12
• The Internal Representativeness of a summary measures the similarity of
this summary against the original full-text research paper
• This similarity is based on the JSD between the topic distribution of each
of them
• Results suggest than the distribution of topics describing the text created
from the approach content is the most similar to the one corresponding to
the full-content of the paper
internal-representativeness
13. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: External Representativeness
13
• The External Representativeness of a summary measures how different
is the set of related documents obtained with respect to those derived
from the original text
• Similarity thresholds from 0.5 to 0.99 were considered in experiments
precision recall
• In terms of recall, the upward trend followed by the approach, the
outcome and the background content remarks the assumption of
summaries containing key words allow to discover more similar papers
than others
14. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: External Representativeness
14
f-measure
• For higher similarity thresholds, i.e. for strongly related papers, the
recommendations discovered by using the approach are more precise
than those discovered by using the abstract.
15. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Conclusions
15
• We have studied the Topic-based similarities among scientific documents
based on their abstract sections with respect to summaries
corresponding to their scientific discourse categories.
• Two novel measures have been proposed: (1) internal-
representativeness and (2) external-representativeness.
• Results show that summaries created from the approach, outcome or
background content of a paper describe more accurately its full-content in
terms of overall ideas and related documents than abstracts.
• In order to avoid an influence of the size of the summaries on the
accuracy of the results, in future work we plan to use probabilistic topic
model algorithms oriented to handle short-texts such as BTM to describe
texts .
16. An initial Analysis of
Topic-based Similarity
among Scientific Documents
based on their
Rhetorical Discourse Parts
Carlos Badenes-Olmedo
Jose Luis Redondo-Garcia
Oscar Corcho
Ontology Engineering Group
Universidad Politécnica de Madrid
Spain
ocorcho@fi.upm.es
@ocorcho ISWC’17
oeg-upm.net