SlideShare a Scribd company logo
1 of 16
An initial Analysis of
Topic-based Similarity
among Scientific Documents
based on their
Rhetorical Discourse Parts
ocorcho@fi.upm.es
@ocorcho ISWC’17
oeg-upm.net
Carlos Badenes-Olmedo
Jose Luis Redondo-Garcia
Oscar Corcho
Ontology Engineering Group
Universidad Politécnica de Madrid
Spain
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Motivation
2
How representative is an abstract?
Scientific Research
Practitioners
Reviewers
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Motivation
3
How representative are summaries based
on scientific discourse categories?
Scientific Research
Practitioners
Reviewers
approach
challenge
background
outcome
future work
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Representativeness
4
Full-Paper
Summary
Internal
External
finding related items
describing main ideas
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Probabilistic Topic Models
5
• Each document is a mixture of corpus-wide topics
• Each topic is a distribution over words
• Each word is drawn from one of those topics
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research,
Latent Dirichlet Allocation (LDA)
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Representativeness Measure
6
Internal
External
precision / recall / f-measure
JSD-based similarity
[d1,d2,d3,..dn] [s1,s2,s3,..sn]
[h1,h2,..hn] [j1,j2,..jn] [j1,j2,..jn] [k1,k2,..kn] [m1,m2,..mn]
Full-Paper Summary
JSD-based
similarity
JSD-based
similarity
• Feature vectors in Topic Models are topic distributions expressed as vectors of probabi
• The similarity measure used in our analysis is based on the Jensen Shannon-Divergen
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
7
Advances in
Space Research
Procedia
Chemistry
Journal of
Pharmaceutical Analysis
Journal of
Web Semantics
Elsevier API
1000 papers
( + abstracts)
Topic
Model
discover
rhetorical
parts
training (only full-papers)
inference
1000 papers
( + abstracts,
+ discourse parts)
network of related papers
( + abstracts + discourse parts)
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
8
Advances in Space Research
Corpus
Procedia Chemistry
Corpus
Journal of Pharmaceutical
Analysis Corpus
Journal of Web Semantics
Corpus
• http://librairy.linkeddata.es/resources/domains/aisr
Test
Corpus
• http://librairy.linkeddata.es/resources/domains/pc
• http://librairy.linkeddata.es/resources/domains/jopa
• http://librairy.linkeddata.es/resources/domains/jows
• http://librairy.linkeddata.es/resources/domains/group1
• Topics in a Corpus:
http://librairy.linkeddata.es/resources/domains/group1/topics?words=10
• Papers in a Corpus:
http://librairy.linkeddata.es/resources/domains/group1/items?size=10
Explore a Corpus
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
9
Full-Paper
• Info:
http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106?content=true
• Parts:
http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106/parts
• abstract:
http://librairy.linkeddata.es/resources/parts/adfe85d9634654e4cfd7148be7cd2b29?content=true
• approach:
http://librairy.linkeddata.es/resources/parts/83f2b9722953034d7b6b50cbead4ec6b?content=true
• outcome:
http://librairy.linkeddata.es/resources/parts/61452a5ec420c8926160ae748c12a826?content=true
• challenge:
http://librairy.linkeddata.es/resources/parts/8858ef323fc09efbdcd46b9de45f146c?content=true
• background:
http://librairy.linkeddata.es/resources/parts/d118ef60d5e874d69d92c6b07be68b61?content=true
• future-work:
http://librairy.linkeddata.es/resources/parts/92be5400df5bb331e5f7f692e6b05bca?content=true
• Topic Distribution of Full-Paper:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-84924147106/topics?words=15
• Topic Distribution of Abstract:
http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/topics?words=15
• Similarity between Full-Paper and Abstract:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&relatedId=adfe85d9634654e4cfd7148be7cd2b29
• Similarity between Full-Paper and Approach content:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&relatedId=83f2b9722953034d7b6b50cbead4ec6b
Internal Representativeness
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
10
• Similar papers to Full-Paper:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&resourceType=item&size=5
• Similar papers to Abstract:
http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila
rity&resourceType=item&size=5
• Similar papers to Approach content:
http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil
arity&resourceType=item&size=5
• Similar summaries to a Full-Paper:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&resourceType=part&size=5
• Similar summaries to an Abstract:
http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila
rity&resourceType=part&size=5
• Similar summaries to Approach:
http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil
arity&resourceType=part&size=5
External Representativeness
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: Size of Summaries
11
The approach, the background
and the outcome content of a
paper generate more accurate
topic distributions than those
created from other approaches
as the abstract.
Since LDA considers documents
as bag-of-words, the text length
affects the accuracy of the topic
distributions inferred by the
model
Relative size of summaries respect to full-paper
Absolute size of summaries (in number of characters)
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: Internal Representativeness
12
• The Internal Representativeness of a summary measures the similarity of
this summary against the original full-text research paper
• This similarity is based on the JSD between the topic distribution of each
of them
• Results suggest than the distribution of topics describing the text created
from the approach content is the most similar to the one corresponding to
the full-content of the paper
internal-representativeness
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: External Representativeness
13
• The External Representativeness of a summary measures how different
is the set of related documents obtained with respect to those derived
from the original text
• Similarity thresholds from 0.5 to 0.99 were considered in experiments
precision recall
• In terms of recall, the upward trend followed by the approach, the
outcome and the background content remarks the assumption of
summaries containing key words allow to discover more similar papers
than others
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: External Representativeness
14
f-measure
• For higher similarity thresholds, i.e. for strongly related papers, the
recommendations discovered by using the approach are more precise
than those discovered by using the abstract.
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Conclusions
15
• We have studied the Topic-based similarities among scientific documents
based on their abstract sections with respect to summaries
corresponding to their scientific discourse categories.
• Two novel measures have been proposed: (1) internal-
representativeness and (2) external-representativeness.
• Results show that summaries created from the approach, outcome or
background content of a paper describe more accurately its full-content in
terms of overall ideas and related documents than abstracts.
• In order to avoid an influence of the size of the summaries on the
accuracy of the results, in future work we plan to use probabilistic topic
model algorithms oriented to handle short-texts such as BTM to describe
texts .
An initial Analysis of
Topic-based Similarity
among Scientific Documents
based on their
Rhetorical Discourse Parts
Carlos Badenes-Olmedo
Jose Luis Redondo-Garcia
Oscar Corcho
Ontology Engineering Group
Universidad Politécnica de Madrid
Spain
ocorcho@fi.upm.es
@ocorcho ISWC’17
oeg-upm.net

More Related Content

What's hot

Compressed full text indexes
Compressed full text indexesCompressed full text indexes
Compressed full text indexes
unyil96
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
infoclio.ch
 
Text databases and information retrieval
Text databases and information retrievalText databases and information retrieval
Text databases and information retrieval
unyil96
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)
Svitlana volkova
 

What's hot (18)

Compressed full text indexes
Compressed full text indexesCompressed full text indexes
Compressed full text indexes
 
Perspectives on mining knowledge graphs from text
Perspectives on mining knowledge graphs from textPerspectives on mining knowledge graphs from text
Perspectives on mining knowledge graphs from text
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 
Combining Approximate String Matching Algorithms and Term Frequency In The De...
Combining Approximate String Matching Algorithms and Term Frequency In The De...Combining Approximate String Matching Algorithms and Term Frequency In The De...
Combining Approximate String Matching Algorithms and Term Frequency In The De...
 
Semantic Web and Linked Data for cultural heritage materials - Approaches in ...
Semantic Web and Linked Data for cultural heritage materials - Approaches in ...Semantic Web and Linked Data for cultural heritage materials - Approaches in ...
Semantic Web and Linked Data for cultural heritage materials - Approaches in ...
 
Linking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsLinking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization Systems
 
Open minted content_provision
Open minted content_provisionOpen minted content_provision
Open minted content_provision
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
 
Text databases and information retrieval
Text databases and information retrievalText databases and information retrieval
Text databases and information retrieval
 
Data Science Workshop
Data Science WorkshopData Science Workshop
Data Science Workshop
 
Open Research Knowledge Graph (ORKG) - an overview
Open Research Knowledge Graph (ORKG) - an overview   Open Research Knowledge Graph (ORKG) - an overview
Open Research Knowledge Graph (ORKG) - an overview
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)
 
Pipe dreams
Pipe dreamsPipe dreams
Pipe dreams
 
A Survey on Text Mining-techniques and application
A Survey on Text Mining-techniques and applicationA Survey on Text Mining-techniques and application
A Survey on Text Mining-techniques and application
 

Similar to An initial analysis of topic-based similarity among scientific documents based on their rhetorical discourse parts

Writing research thesis literature review
Writing research thesis literature reviewWriting research thesis literature review
Writing research thesis literature review
Muhammad Riaz
 
Lecture 6 - Literature Review.pptx
Lecture 6 - Literature Review.pptxLecture 6 - Literature Review.pptx
Lecture 6 - Literature Review.pptx
HafeezUllah783173
 
Experimental psychology spring 2015
Experimental psychology   spring 2015Experimental psychology   spring 2015
Experimental psychology spring 2015
k-baril
 
Unit 6. Literature Review & Synthesis.pptx
Unit 6. Literature Review & Synthesis.pptxUnit 6. Literature Review & Synthesis.pptx
Unit 6. Literature Review & Synthesis.pptx
shakirRahman10
 
PSYC 3401
PSYC 3401PSYC 3401
PSYC 3401
Traciwm
 

Similar to An initial analysis of topic-based similarity among scientific documents based on their rhetorical discourse parts (20)

20131005_Reviewing the literature.pdf
20131005_Reviewing the literature.pdf20131005_Reviewing the literature.pdf
20131005_Reviewing the literature.pdf
 
Writing research thesis literature review
Writing research thesis literature reviewWriting research thesis literature review
Writing research thesis literature review
 
Lecture 6 - Literature Review.pptx
Lecture 6 - Literature Review.pptxLecture 6 - Literature Review.pptx
Lecture 6 - Literature Review.pptx
 
Review.pdf
Review.pdfReview.pdf
Review.pdf
 
Levine-Clark, Michael, “Citation Indexes,” Seminario Entre Pares, Puebla, Mex...
Levine-Clark, Michael, “Citation Indexes,” Seminario Entre Pares, Puebla, Mex...Levine-Clark, Michael, “Citation Indexes,” Seminario Entre Pares, Puebla, Mex...
Levine-Clark, Michael, “Citation Indexes,” Seminario Entre Pares, Puebla, Mex...
 
Experimental psychology spring 2015
Experimental psychology   spring 2015Experimental psychology   spring 2015
Experimental psychology spring 2015
 
Study design & anatomy of scientific research
Study design & anatomy of scientific researchStudy design & anatomy of scientific research
Study design & anatomy of scientific research
 
Literature Review.ppt
Literature Review.pptLiterature Review.ppt
Literature Review.ppt
 
MELJUN CORTES research seminar_1__preparing_your_paper_summer_1516
MELJUN CORTES research seminar_1__preparing_your_paper_summer_1516MELJUN CORTES research seminar_1__preparing_your_paper_summer_1516
MELJUN CORTES research seminar_1__preparing_your_paper_summer_1516
 
Unit 6. Literature Review & Synthesis.pptx
Unit 6. Literature Review & Synthesis.pptxUnit 6. Literature Review & Synthesis.pptx
Unit 6. Literature Review & Synthesis.pptx
 
Literature Search and Review
Literature Search and ReviewLiterature Search and Review
Literature Search and Review
 
Chapter-2-1.pptx
Chapter-2-1.pptxChapter-2-1.pptx
Chapter-2-1.pptx
 
Literature Review - How to write effectively.pptx
Literature Review - How to write effectively.pptxLiterature Review - How to write effectively.pptx
Literature Review - How to write effectively.pptx
 
Literature Review and Research Related Problems
Literature Review and Research Related ProblemsLiterature Review and Research Related Problems
Literature Review and Research Related Problems
 
محاضرة 2
محاضرة 2محاضرة 2
محاضرة 2
 
كيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبيكيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبي
 
3.rm the literature review
3.rm the literature review3.rm the literature review
3.rm the literature review
 
08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx
 
PPT on literature review.pdf
PPT on literature review.pdfPPT on literature review.pdf
PPT on literature review.pdf
 
PSYC 3401
PSYC 3401PSYC 3401
PSYC 3401
 

More from Oscar Corcho

More from Oscar Corcho (20)

Organisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de MadridOrganisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de Madrid
 
Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020
 
Open Data (and Software, and other Research Artefacts) - A proper management
Open Data (and Software, and other Research Artefacts) -A proper managementOpen Data (and Software, and other Research Artefacts) -A proper management
Open Data (and Software, and other Research Artefacts) - A proper management
 
Adiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosAdiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticos
 
Ontology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data Sharing
 
Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...
 
STARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación LumínicaSTARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación Lumínica
 
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experienceTowards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
 
Publishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case studyPublishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case study
 
Linked Statistical Data 101
Linked Statistical Data 101Linked Statistical Data 101
Linked Statistical Data 101
 
Aplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMETAplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMET
 
Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016
 
Educando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidadEducando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidad
 
STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016
 
Generación de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de EstadísticaGeneración de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de Estadística
 
Presentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart CitiesPresentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart Cities
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 

An initial analysis of topic-based similarity among scientific documents based on their rhetorical discourse parts

  • 1. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts ocorcho@fi.upm.es @ocorcho ISWC’17 oeg-upm.net Carlos Badenes-Olmedo Jose Luis Redondo-Garcia Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Spain
  • 2. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Motivation 2 How representative is an abstract? Scientific Research Practitioners Reviewers
  • 3. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Motivation 3 How representative are summaries based on scientific discourse categories? Scientific Research Practitioners Reviewers approach challenge background outcome future work
  • 4. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Representativeness 4 Full-Paper Summary Internal External finding related items describing main ideas
  • 5. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Probabilistic Topic Models 5 • Each document is a mixture of corpus-wide topics • Each topic is a distribution over words • Each word is drawn from one of those topics Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, Latent Dirichlet Allocation (LDA)
  • 6. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Representativeness Measure 6 Internal External precision / recall / f-measure JSD-based similarity [d1,d2,d3,..dn] [s1,s2,s3,..sn] [h1,h2,..hn] [j1,j2,..jn] [j1,j2,..jn] [k1,k2,..kn] [m1,m2,..mn] Full-Paper Summary JSD-based similarity JSD-based similarity • Feature vectors in Topic Models are topic distributions expressed as vectors of probabi • The similarity measure used in our analysis is based on the Jensen Shannon-Divergen
  • 7. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 7 Advances in Space Research Procedia Chemistry Journal of Pharmaceutical Analysis Journal of Web Semantics Elsevier API 1000 papers ( + abstracts) Topic Model discover rhetorical parts training (only full-papers) inference 1000 papers ( + abstracts, + discourse parts) network of related papers ( + abstracts + discourse parts)
  • 8. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 8 Advances in Space Research Corpus Procedia Chemistry Corpus Journal of Pharmaceutical Analysis Corpus Journal of Web Semantics Corpus • http://librairy.linkeddata.es/resources/domains/aisr Test Corpus • http://librairy.linkeddata.es/resources/domains/pc • http://librairy.linkeddata.es/resources/domains/jopa • http://librairy.linkeddata.es/resources/domains/jows • http://librairy.linkeddata.es/resources/domains/group1 • Topics in a Corpus: http://librairy.linkeddata.es/resources/domains/group1/topics?words=10 • Papers in a Corpus: http://librairy.linkeddata.es/resources/domains/group1/items?size=10 Explore a Corpus
  • 9. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 9 Full-Paper • Info: http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106?content=true • Parts: http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106/parts • abstract: http://librairy.linkeddata.es/resources/parts/adfe85d9634654e4cfd7148be7cd2b29?content=true • approach: http://librairy.linkeddata.es/resources/parts/83f2b9722953034d7b6b50cbead4ec6b?content=true • outcome: http://librairy.linkeddata.es/resources/parts/61452a5ec420c8926160ae748c12a826?content=true • challenge: http://librairy.linkeddata.es/resources/parts/8858ef323fc09efbdcd46b9de45f146c?content=true • background: http://librairy.linkeddata.es/resources/parts/d118ef60d5e874d69d92c6b07be68b61?content=true • future-work: http://librairy.linkeddata.es/resources/parts/92be5400df5bb331e5f7f692e6b05bca?content=true • Topic Distribution of Full-Paper: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-84924147106/topics?words=15 • Topic Distribution of Abstract: http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/topics?words=15 • Similarity between Full-Paper and Abstract: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&relatedId=adfe85d9634654e4cfd7148be7cd2b29 • Similarity between Full-Paper and Approach content: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&relatedId=83f2b9722953034d7b6b50cbead4ec6b Internal Representativeness
  • 10. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 10 • Similar papers to Full-Paper: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&resourceType=item&size=5 • Similar papers to Abstract: http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila rity&resourceType=item&size=5 • Similar papers to Approach content: http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil arity&resourceType=item&size=5 • Similar summaries to a Full-Paper: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&resourceType=part&size=5 • Similar summaries to an Abstract: http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila rity&resourceType=part&size=5 • Similar summaries to Approach: http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil arity&resourceType=part&size=5 External Representativeness
  • 11. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: Size of Summaries 11 The approach, the background and the outcome content of a paper generate more accurate topic distributions than those created from other approaches as the abstract. Since LDA considers documents as bag-of-words, the text length affects the accuracy of the topic distributions inferred by the model Relative size of summaries respect to full-paper Absolute size of summaries (in number of characters)
  • 12. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: Internal Representativeness 12 • The Internal Representativeness of a summary measures the similarity of this summary against the original full-text research paper • This similarity is based on the JSD between the topic distribution of each of them • Results suggest than the distribution of topics describing the text created from the approach content is the most similar to the one corresponding to the full-content of the paper internal-representativeness
  • 13. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: External Representativeness 13 • The External Representativeness of a summary measures how different is the set of related documents obtained with respect to those derived from the original text • Similarity thresholds from 0.5 to 0.99 were considered in experiments precision recall • In terms of recall, the upward trend followed by the approach, the outcome and the background content remarks the assumption of summaries containing key words allow to discover more similar papers than others
  • 14. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: External Representativeness 14 f-measure • For higher similarity thresholds, i.e. for strongly related papers, the recommendations discovered by using the approach are more precise than those discovered by using the abstract.
  • 15. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Conclusions 15 • We have studied the Topic-based similarities among scientific documents based on their abstract sections with respect to summaries corresponding to their scientific discourse categories. • Two novel measures have been proposed: (1) internal- representativeness and (2) external-representativeness. • Results show that summaries created from the approach, outcome or background content of a paper describe more accurately its full-content in terms of overall ideas and related documents than abstracts. • In order to avoid an influence of the size of the summaries on the accuracy of the results, in future work we plan to use probabilistic topic model algorithms oriented to handle short-texts such as BTM to describe texts .
  • 16. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo Jose Luis Redondo-Garcia Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Spain ocorcho@fi.upm.es @ocorcho ISWC’17 oeg-upm.net