SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Jennifer D’Souza and Sören Auer
http://orkg.org | @orkg_org
Technische Informationsbibliothek (TIB)
Welfengarten 1B // 30167 Hannover
Pattern-based Acquisition of Scientific Entities
from Scholarly Article Titles
1
● Given scholarly article text which may include one or more of the following
aspects, i.e. the scholarly article title, abstract, or full-text, to extract meaningful
entities that are semantically valid scientific terms and type them with relevant
semantic concepts.
○ The terms may directly pertain to the actual work proposed in a paper or may be a reference
to other research that contributed to the paper idea.
● Example
○ Exploiting Headword Dependency and Predictive Clustering for Language Modeling [1]
References
1. Gupta, Sonal, and Christopher D. Manning. "Analyzing the dynamics of research by extracting key aspects of scientific papers." Proceedings of 5th international joint conference
on natural language processing. 2011.
Scientific Entity Extraction
2
Technique Technique
Focus and
Domain
● Challenging because
○ no standardized set of concept types available yet even for Computer Science
References
1. Gupta, Sonal, and Christopher D. Manning. "Analyzing the dynamics of research by extracting key aspects of scientific papers." Proceedings of 5th international joint conference
on natural language processing. 2011.
2. D’Souza, Jennifer, et al. "The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources."
Proceedings of the 12th Language Resources and Evaluation Conference. 2020.
3. Kabongo, Salomon, Jennifer D'Souza, and Sören Auer. "Automated Mining of Leaderboards for Empirical AI Research." arXiv preprint arXiv:2109.13089 (2021).
Scientific Entity Extraction
3
Corpus name Domains Coverage Semantic concepts
FTD [1] Computational Linguistics titles, abstracts focus, domain, technique
STEM-ECR [2] 10 STEM disciplines abstracts data, material, method, process
ORKG-TDM [3] AI titles, abstracts,
full text
task, dataset, metric
● We seek to overcome this limitation by formulating a very precise extraction
objective which in turn limits the semantic possibilities of concept types.
Scientific Entity Extraction: Our Work
4
● A rule-based approach for the automatic extraction of salient scientific entities
from Computational Linguistics (CL) scholarly article titles.
○ Salient: Those entities that constitute the original contribution of a work
● Motivation
○ Align the extraction objective with existing digital libraries like the Open Research
Knowledge Graph (ORKG) where our system can be integrated.
■ The ORKG is a digital library for machine-actionable knowledge about scholarly
contributions communicated in scholarly articles. https://www.orkg.org/orkg/
■ Benefit: With intelligent analytics over KGs researchers can easily track research
progress without the cognitive overhead that reading dozens of articles impose.
A typical dilemma then with building such an KG is deciding the type of information to be represented. In
other words, what would be the information constituent candidates for an KG that reflects the overview?
Scientific Entity Extraction: Our Work
5
● A rule-based approach for the automatic extraction of salient scientific entities
from Computational Linguistics (CL) scholarly article titles.
○ Salient: Those entities that constitute the original contribution of a work
● Why titles?
○ They are succinct formulations of salient aspects of the contribution of a research work
Scientific Entity Extraction: Our Work
6
● A rule-based approach for the automatic extraction of salient scientific entities
from Computational Linguistics (CL) scholarly article titles.
○ Salient: Those entities that constitute the original contribution of a work
● Why a rule-based approach?
○ Titles are written with lexico-syntactic pattern regularities that are generalizable as a set of
extraction heuristics.
○ It is lightweight and works out-of-the-box without the need for sophisticated computational
resources while being nonetheless effective in satisfying the extraction objective.
■ Note that supervized machine learning models in the present age of neural models rely on
sophisticated computational hardware in terms of GPUs, RAM etc.
Scientific Entity Extraction: Our Work
7
● A rule-based approach for the automatic extraction of salient scientific entities
from Computational Linguistics (CL) scholarly article titles.
○ Salient: Those entities that constitute the original contribution of a work
● To our best knowledge, a corpus of only article titles remains as yet
comprehensively unexplored as a resource for scholarly knowledge graph
building. Thus, our work sheds a unique and novel light on SKG construction
representing research overviews with a rule-based system.
Scientific Entity Extraction: Our Work
8
Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
9
Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
10
● We downloaded all the article titles in the ACL anthology as the `Full Anthology
as BibTeX' file dated 1-02-2021.
○ See https://aclanthology.org/anthology.bib.gz
● From a total of 60,621 titles, the evaluation corpus comprised 50,237 titles after
eliminating duplicates and invalid titles.
Raw Dataset
11
Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
12
Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
13
● Definitions of Scientific Concept Types considered in our work
1. Research problem: The theme of the investigation. E.g., “Natural Language Inference”
2. Resource: Names of existing data and other references to utilities like the Web, Encyclopedia,
etc., used to address the research problem or used in the solution. E.g., “Using Encyclopedic
Knowledge for Automatic Topic Identification.”
3. Tool: A tool can be seen as a type of a resource and specifically software. E.g., BERT.
4. Solution: A novel contribution of a work that solves the research problem. E.g., from the title
“PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation,” the terms
“PHINC” and “A Parallel Hinglish Social Media Code-Mixed Corpus” are solutions for the problem
“Machine Translation.”
5. Language: The natural language focus of a work. E.g., Breton, Lakota, etc.
6. Method: They refer to existing protocols used to support the solution; found by asking “How?”
Pattern-based Approach to Scientific Entities Extraction
14
● Formalism
Every CL title T can be expressed as one or more of the following six elements tei =
<rpi, resi, tooli, langi, soli, methi>, representing the research problem, resource, tool,
language, solution, and method concepts, respectively. A title can contain terms for zero
or more of any of the concepts.
The goal of CL-Titles-Parser, for every title ti, to annotate its title expression tei,
involving scientific term extraction and term concept typing.
Pattern-based Approach to Scientific Entities Extraction
15
● CL-Titles-Parser operates in a two-step workflow.
○ First, it aggregates titles as eight main template types with a default ninth category
for titles that could not be clustered by any of the eight templates.
○ Second, within each group, heuristics are applied to phrase-chunk and concept-type
the titles based on group-specific lexico-syntactic patterns.
Pattern-based Approach to Scientific Entities Extraction
16
● CL-Titles-Parser operates in a two-step workflow.
○ Step 1: Titles are clustered based on commonly shared lexico-syntactic patterns.
Template “hasSpecialCaseWord()”
applies to titles written in two parts -- a one-word solution name, a colon separator, and an
elaboration of the solution name.
E.g., “SNOPAR: A Grammar Testing System” consisting of the one word “SNOPAR” solution
name and its elaboration “A Grammar Testing System.”
Pattern-based Approach to Scientific Entities Extraction
17
● CL-Titles-Parser operates in a two-step workflow.
○ Step 1: Titles are clustered based on commonly shared lexico-syntactic patterns.
Template “hasSpecialCaseWord()”
applies to titles written in two parts -- a one-word solution name, a colon separator, and an
elaboration of the solution name.
There are other instances of titles belonging to this template type that are complex
sentences, i.e. titles with additional prepositional or verb phrases, where mentions of the
research problem, tool, method, language domain etc. are also included in the latter part of
the title.
E.g., “GRAFON: A Grapheme-to-Phoneme Conversion System for Dutch” is a complex title
with a prepositional phrase triggered by “for” specifying the language domain “Dutch.”
Pattern-based Approach to Scientific Entities Extraction
18
● CL-Titles-Parser operates in a two-step workflow.
○ Step 2: Precedence-ordered scientific term extraction and typing rules are applied.
Works in two steps
1. Determining the “connector” positions within a title. The title is chunked at these
“connector” positions.
Our connectors are a collection of 11 prepositions and 1 verb defined as:
connectorsrx = (to|of|on|for|from|with|by|via|through|using|in|as)
Pattern-based Approach to Scientific Entities Extraction
19
● CL-Titles-Parser operates in a two-step workflow.
○ Step 2: Precedence-ordered scientific term extraction and typing rules are applied.
Works in two steps
2. Based on the number of connectors, the title is processed within a precedence-ordered
set of heuristics workflow for concept typing.
E.g., if a title has one connector, it enters first into the OneConnectorHeu() branch.
There, the first step is determining which connector is in the phrase. Then based on the
connector, separate sets of concept typing precedence rules apply. E.g., if the connector is
“from” the title subphrases are typed based on the following pattern: solution from resource.
Pattern-based Approach to Scientific Entities Extraction
20
Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
21
Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
22
● CL-Titles-Parser when applied to 50,237 extracted 19,799 research problem,
18,111 solution, 20,033 resource, 1,059 language, 6,878 tool, and 21,687
method. These scientific concept lists were then evaluated for extraction
precision.
● Precision = total correctly extracted concepts / total extracted concepts
Evaluations: Experimental Setup
23
Evaluations: Extraction Precision Results
24
Concept Type Precision
language 95.12%
resource 86.96%
tool 83.40%
solution 80.77%
method 77.29%
research problem 58.09%
Evaluations: Extraction Precision Results
25
Concept Type Precision
language 95.12%
resource 86.96%
tool 83.40%
solution 80.77%
method 77.29%
research problem 58.09%
Extraction heuristics for language were most precise. Relies on a regex list of
languages, therefore in a sense is limited by the list for recall.
But this is characteristic of rule-based systems. Our list is quite large covering
various obscure languages. A zero-shot machine learning approach would be
an alternative to experiment with.
Evaluations: Extraction Precision Results
26
Concept Type Precision
language 95.12%
resource 86.96%
tool 83.40%
solution 80.77%
method 77.29%
research problem 58.09%
The gold-standard list curation was biased toward already familiar research problems or
their derivations. Thus we estimate that at least 20% terms were pruned in the gold data
because they were relatively new as opposed to being incorrect.
Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
27
● A qualitative analysis of the extracted terms for titles written in the 20th vs. the 21st
centuries was performed and the outcome of the most frequently used entities was
indeed indicative of the times.
○ E.g., social media channels like Twitter, the web, or online encyclopedia like
Wikipedia are predominant resources in the 21st century. This is contrast to text,
discourse, dialogues, parse trees leveraged as resources in the 20th century.
● We proposed an incremental step toward the larger goal of generating contributions-
focused SKGs.
○ The absence of inter-annotator agreement scores to determine the reliability with
which the concepts can be selected will also be addressed in future work.
● Our code is publicly available on Github: https://github.com/jd-coderepos/cl-titles-parser/
Conclusion: Takeaways
28
Happy to take questions
Thank you for your attention!
29

Weitere ähnliche Inhalte

Was ist angesagt?

Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologiesFAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologiesResearch Data Alliance
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataRinke Hoekstra
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Stuart Chalk
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph FuturesPaul Groth
 
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives TaiwanA Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwanandrea huang
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphSören Auer
 

Was ist angesagt? (14)

Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
semantic web & natural language
semantic web & natural languagesemantic web & natural language
semantic web & natural language
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
Topical_Facets
Topical_FacetsTopical_Facets
Topical_Facets
 
General Introduction for Semantic Web and Linked Open Data
General Introduction for Semantic Web and Linked Open DataGeneral Introduction for Semantic Web and Linked Open Data
General Introduction for Semantic Web and Linked Open Data
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologiesFAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologies
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives TaiwanA Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 

Ähnlich wie Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles @ ICADL 2021

Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Khirulnizam Abd Rahman
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Linglucenerevolution
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise SearchFindwise
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
 
2_presFriday_ontologydevelopment
2_presFriday_ontologydevelopment2_presFriday_ontologydevelopment
2_presFriday_ontologydevelopmentPieter Pauwels
 
Object-Oriented Writing: augmented writing for creating coherent and argument...
Object-Oriented Writing: augmented writing for creating coherent and argument...Object-Oriented Writing: augmented writing for creating coherent and argument...
Object-Oriented Writing: augmented writing for creating coherent and argument...Seong-Young Her
 
151718520442.pptx
151718520442.pptx151718520442.pptx
151718520442.pptxobedcudjoe1
 
Writing and Publishing a Scientific Research Paper
Writing and Publishing a Scientific Research PaperWriting and Publishing a Scientific Research Paper
Writing and Publishing a Scientific Research PaperInteX Research Lab
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityOscar Corcho
 
Presentation_Doceng.pptx
Presentation_Doceng.pptxPresentation_Doceng.pptx
Presentation_Doceng.pptxXINWEI50
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
Research Paper Selection Based On an Ontology and Text Mining Technique Using...
Research Paper Selection Based On an Ontology and Text Mining Technique Using...Research Paper Selection Based On an Ontology and Text Mining Technique Using...
Research Paper Selection Based On an Ontology and Text Mining Technique Using...IOSR Journals
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
Statistical Entity Linking
Statistical Entity LinkingStatistical Entity Linking
Statistical Entity LinkingPyDataParis
 

Ähnlich wie Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles @ ICADL 2021 (20)

Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Ling
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise Search
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
 
2_presFriday_ontologydevelopment
2_presFriday_ontologydevelopment2_presFriday_ontologydevelopment
2_presFriday_ontologydevelopment
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Object-Oriented Writing: augmented writing for creating coherent and argument...
Object-Oriented Writing: augmented writing for creating coherent and argument...Object-Oriented Writing: augmented writing for creating coherent and argument...
Object-Oriented Writing: augmented writing for creating coherent and argument...
 
151718520442.pptx
151718520442.pptx151718520442.pptx
151718520442.pptx
 
Writing and Publishing a Scientific Research Paper
Writing and Publishing a Scientific Research PaperWriting and Publishing a Scientific Research Paper
Writing and Publishing a Scientific Research Paper
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
 
Presentation_Doceng.pptx
Presentation_Doceng.pptxPresentation_Doceng.pptx
Presentation_Doceng.pptx
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Research Paper Selection Based On an Ontology and Text Mining Technique Using...
Research Paper Selection Based On an Ontology and Text Mining Technique Using...Research Paper Selection Based On an Ontology and Text Mining Technique Using...
Research Paper Selection Based On an Ontology and Text Mining Technique Using...
 
M017116571
M017116571M017116571
M017116571
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Statistical Entity Linking
Statistical Entity LinkingStatistical Entity Linking
Statistical Entity Linking
 

Kürzlich hochgeladen

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 

Kürzlich hochgeladen (20)

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 

Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles @ ICADL 2021

  • 1. Jennifer D’Souza and Sören Auer http://orkg.org | @orkg_org Technische Informationsbibliothek (TIB) Welfengarten 1B // 30167 Hannover Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles 1
  • 2. ● Given scholarly article text which may include one or more of the following aspects, i.e. the scholarly article title, abstract, or full-text, to extract meaningful entities that are semantically valid scientific terms and type them with relevant semantic concepts. ○ The terms may directly pertain to the actual work proposed in a paper or may be a reference to other research that contributed to the paper idea. ● Example ○ Exploiting Headword Dependency and Predictive Clustering for Language Modeling [1] References 1. Gupta, Sonal, and Christopher D. Manning. "Analyzing the dynamics of research by extracting key aspects of scientific papers." Proceedings of 5th international joint conference on natural language processing. 2011. Scientific Entity Extraction 2 Technique Technique Focus and Domain
  • 3. ● Challenging because ○ no standardized set of concept types available yet even for Computer Science References 1. Gupta, Sonal, and Christopher D. Manning. "Analyzing the dynamics of research by extracting key aspects of scientific papers." Proceedings of 5th international joint conference on natural language processing. 2011. 2. D’Souza, Jennifer, et al. "The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources." Proceedings of the 12th Language Resources and Evaluation Conference. 2020. 3. Kabongo, Salomon, Jennifer D'Souza, and Sören Auer. "Automated Mining of Leaderboards for Empirical AI Research." arXiv preprint arXiv:2109.13089 (2021). Scientific Entity Extraction 3 Corpus name Domains Coverage Semantic concepts FTD [1] Computational Linguistics titles, abstracts focus, domain, technique STEM-ECR [2] 10 STEM disciplines abstracts data, material, method, process ORKG-TDM [3] AI titles, abstracts, full text task, dataset, metric
  • 4. ● We seek to overcome this limitation by formulating a very precise extraction objective which in turn limits the semantic possibilities of concept types. Scientific Entity Extraction: Our Work 4
  • 5. ● A rule-based approach for the automatic extraction of salient scientific entities from Computational Linguistics (CL) scholarly article titles. ○ Salient: Those entities that constitute the original contribution of a work ● Motivation ○ Align the extraction objective with existing digital libraries like the Open Research Knowledge Graph (ORKG) where our system can be integrated. ■ The ORKG is a digital library for machine-actionable knowledge about scholarly contributions communicated in scholarly articles. https://www.orkg.org/orkg/ ■ Benefit: With intelligent analytics over KGs researchers can easily track research progress without the cognitive overhead that reading dozens of articles impose. A typical dilemma then with building such an KG is deciding the type of information to be represented. In other words, what would be the information constituent candidates for an KG that reflects the overview? Scientific Entity Extraction: Our Work 5
  • 6. ● A rule-based approach for the automatic extraction of salient scientific entities from Computational Linguistics (CL) scholarly article titles. ○ Salient: Those entities that constitute the original contribution of a work ● Why titles? ○ They are succinct formulations of salient aspects of the contribution of a research work Scientific Entity Extraction: Our Work 6
  • 7. ● A rule-based approach for the automatic extraction of salient scientific entities from Computational Linguistics (CL) scholarly article titles. ○ Salient: Those entities that constitute the original contribution of a work ● Why a rule-based approach? ○ Titles are written with lexico-syntactic pattern regularities that are generalizable as a set of extraction heuristics. ○ It is lightweight and works out-of-the-box without the need for sophisticated computational resources while being nonetheless effective in satisfying the extraction objective. ■ Note that supervized machine learning models in the present age of neural models rely on sophisticated computational hardware in terms of GPUs, RAM etc. Scientific Entity Extraction: Our Work 7
  • 8. ● A rule-based approach for the automatic extraction of salient scientific entities from Computational Linguistics (CL) scholarly article titles. ○ Salient: Those entities that constitute the original contribution of a work ● To our best knowledge, a corpus of only article titles remains as yet comprehensively unexplored as a resource for scholarly knowledge graph building. Thus, our work sheds a unique and novel light on SKG construction representing research overviews with a rule-based system. Scientific Entity Extraction: Our Work 8
  • 9. Plan for the Talk ● Raw Dataset ● Pattern-based approach to Scientific Entities Extraction ● Evaluations 9
  • 10. Plan for the Talk ● Raw Dataset ● Pattern-based approach to Scientific Entities Extraction ● Evaluations 10
  • 11. ● We downloaded all the article titles in the ACL anthology as the `Full Anthology as BibTeX' file dated 1-02-2021. ○ See https://aclanthology.org/anthology.bib.gz ● From a total of 60,621 titles, the evaluation corpus comprised 50,237 titles after eliminating duplicates and invalid titles. Raw Dataset 11
  • 12. Plan for the Talk ● Raw Dataset ● Pattern-based approach to Scientific Entities Extraction ● Evaluations 12
  • 13. Plan for the Talk ● Raw Dataset ● Pattern-based approach to Scientific Entities Extraction ● Evaluations 13
  • 14. ● Definitions of Scientific Concept Types considered in our work 1. Research problem: The theme of the investigation. E.g., “Natural Language Inference” 2. Resource: Names of existing data and other references to utilities like the Web, Encyclopedia, etc., used to address the research problem or used in the solution. E.g., “Using Encyclopedic Knowledge for Automatic Topic Identification.” 3. Tool: A tool can be seen as a type of a resource and specifically software. E.g., BERT. 4. Solution: A novel contribution of a work that solves the research problem. E.g., from the title “PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation,” the terms “PHINC” and “A Parallel Hinglish Social Media Code-Mixed Corpus” are solutions for the problem “Machine Translation.” 5. Language: The natural language focus of a work. E.g., Breton, Lakota, etc. 6. Method: They refer to existing protocols used to support the solution; found by asking “How?” Pattern-based Approach to Scientific Entities Extraction 14
  • 15. ● Formalism Every CL title T can be expressed as one or more of the following six elements tei = <rpi, resi, tooli, langi, soli, methi>, representing the research problem, resource, tool, language, solution, and method concepts, respectively. A title can contain terms for zero or more of any of the concepts. The goal of CL-Titles-Parser, for every title ti, to annotate its title expression tei, involving scientific term extraction and term concept typing. Pattern-based Approach to Scientific Entities Extraction 15
  • 16. ● CL-Titles-Parser operates in a two-step workflow. ○ First, it aggregates titles as eight main template types with a default ninth category for titles that could not be clustered by any of the eight templates. ○ Second, within each group, heuristics are applied to phrase-chunk and concept-type the titles based on group-specific lexico-syntactic patterns. Pattern-based Approach to Scientific Entities Extraction 16
  • 17. ● CL-Titles-Parser operates in a two-step workflow. ○ Step 1: Titles are clustered based on commonly shared lexico-syntactic patterns. Template “hasSpecialCaseWord()” applies to titles written in two parts -- a one-word solution name, a colon separator, and an elaboration of the solution name. E.g., “SNOPAR: A Grammar Testing System” consisting of the one word “SNOPAR” solution name and its elaboration “A Grammar Testing System.” Pattern-based Approach to Scientific Entities Extraction 17
  • 18. ● CL-Titles-Parser operates in a two-step workflow. ○ Step 1: Titles are clustered based on commonly shared lexico-syntactic patterns. Template “hasSpecialCaseWord()” applies to titles written in two parts -- a one-word solution name, a colon separator, and an elaboration of the solution name. There are other instances of titles belonging to this template type that are complex sentences, i.e. titles with additional prepositional or verb phrases, where mentions of the research problem, tool, method, language domain etc. are also included in the latter part of the title. E.g., “GRAFON: A Grapheme-to-Phoneme Conversion System for Dutch” is a complex title with a prepositional phrase triggered by “for” specifying the language domain “Dutch.” Pattern-based Approach to Scientific Entities Extraction 18
  • 19. ● CL-Titles-Parser operates in a two-step workflow. ○ Step 2: Precedence-ordered scientific term extraction and typing rules are applied. Works in two steps 1. Determining the “connector” positions within a title. The title is chunked at these “connector” positions. Our connectors are a collection of 11 prepositions and 1 verb defined as: connectorsrx = (to|of|on|for|from|with|by|via|through|using|in|as) Pattern-based Approach to Scientific Entities Extraction 19
  • 20. ● CL-Titles-Parser operates in a two-step workflow. ○ Step 2: Precedence-ordered scientific term extraction and typing rules are applied. Works in two steps 2. Based on the number of connectors, the title is processed within a precedence-ordered set of heuristics workflow for concept typing. E.g., if a title has one connector, it enters first into the OneConnectorHeu() branch. There, the first step is determining which connector is in the phrase. Then based on the connector, separate sets of concept typing precedence rules apply. E.g., if the connector is “from” the title subphrases are typed based on the following pattern: solution from resource. Pattern-based Approach to Scientific Entities Extraction 20
  • 21. Plan for the Talk ● Raw Dataset ● Pattern-based approach to Scientific Entities Extraction ● Evaluations 21
  • 22. Plan for the Talk ● Raw Dataset ● Pattern-based approach to Scientific Entities Extraction ● Evaluations 22
  • 23. ● CL-Titles-Parser when applied to 50,237 extracted 19,799 research problem, 18,111 solution, 20,033 resource, 1,059 language, 6,878 tool, and 21,687 method. These scientific concept lists were then evaluated for extraction precision. ● Precision = total correctly extracted concepts / total extracted concepts Evaluations: Experimental Setup 23
  • 24. Evaluations: Extraction Precision Results 24 Concept Type Precision language 95.12% resource 86.96% tool 83.40% solution 80.77% method 77.29% research problem 58.09%
  • 25. Evaluations: Extraction Precision Results 25 Concept Type Precision language 95.12% resource 86.96% tool 83.40% solution 80.77% method 77.29% research problem 58.09% Extraction heuristics for language were most precise. Relies on a regex list of languages, therefore in a sense is limited by the list for recall. But this is characteristic of rule-based systems. Our list is quite large covering various obscure languages. A zero-shot machine learning approach would be an alternative to experiment with.
  • 26. Evaluations: Extraction Precision Results 26 Concept Type Precision language 95.12% resource 86.96% tool 83.40% solution 80.77% method 77.29% research problem 58.09% The gold-standard list curation was biased toward already familiar research problems or their derivations. Thus we estimate that at least 20% terms were pruned in the gold data because they were relatively new as opposed to being incorrect.
  • 27. Plan for the Talk ● Raw Dataset ● Pattern-based approach to Scientific Entities Extraction ● Evaluations 27
  • 28. ● A qualitative analysis of the extracted terms for titles written in the 20th vs. the 21st centuries was performed and the outcome of the most frequently used entities was indeed indicative of the times. ○ E.g., social media channels like Twitter, the web, or online encyclopedia like Wikipedia are predominant resources in the 21st century. This is contrast to text, discourse, dialogues, parse trees leveraged as resources in the 20th century. ● We proposed an incremental step toward the larger goal of generating contributions- focused SKGs. ○ The absence of inter-annotator agreement scores to determine the reliability with which the concepts can be selected will also be addressed in future work. ● Our code is publicly available on Github: https://github.com/jd-coderepos/cl-titles-parser/ Conclusion: Takeaways 28
  • 29. Happy to take questions Thank you for your attention! 29