We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article’s contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed in a set of rules. Only those lexico-syntactic patterns were selected that were easily recognizable, occurred frequently, and positionally indicated a scientific entity type. The rules were developed on a collection of 50,237 CL titles covering all articles in the ACL Anthology. In total, 19,799 research problems, 18,111 solutions, 20,033 resources, 1,059 languages, 6,878 tools, and 21,687 methods were extracted at an average precision of 75%.
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles @ ICADL 2021
1. Jennifer D’Souza and Sören Auer
http://orkg.org | @orkg_org
Technische Informationsbibliothek (TIB)
Welfengarten 1B // 30167 Hannover
Pattern-based Acquisition of Scientific Entities
from Scholarly Article Titles
1
2. ● Given scholarly article text which may include one or more of the following
aspects, i.e. the scholarly article title, abstract, or full-text, to extract meaningful
entities that are semantically valid scientific terms and type them with relevant
semantic concepts.
○ The terms may directly pertain to the actual work proposed in a paper or may be a reference
to other research that contributed to the paper idea.
● Example
○ Exploiting Headword Dependency and Predictive Clustering for Language Modeling [1]
References
1. Gupta, Sonal, and Christopher D. Manning. "Analyzing the dynamics of research by extracting key aspects of scientific papers." Proceedings of 5th international joint conference
on natural language processing. 2011.
Scientific Entity Extraction
2
Technique Technique
Focus and
Domain
3. ● Challenging because
○ no standardized set of concept types available yet even for Computer Science
References
1. Gupta, Sonal, and Christopher D. Manning. "Analyzing the dynamics of research by extracting key aspects of scientific papers." Proceedings of 5th international joint conference
on natural language processing. 2011.
2. D’Souza, Jennifer, et al. "The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources."
Proceedings of the 12th Language Resources and Evaluation Conference. 2020.
3. Kabongo, Salomon, Jennifer D'Souza, and Sören Auer. "Automated Mining of Leaderboards for Empirical AI Research." arXiv preprint arXiv:2109.13089 (2021).
Scientific Entity Extraction
3
Corpus name Domains Coverage Semantic concepts
FTD [1] Computational Linguistics titles, abstracts focus, domain, technique
STEM-ECR [2] 10 STEM disciplines abstracts data, material, method, process
ORKG-TDM [3] AI titles, abstracts,
full text
task, dataset, metric
4. ● We seek to overcome this limitation by formulating a very precise extraction
objective which in turn limits the semantic possibilities of concept types.
Scientific Entity Extraction: Our Work
4
5. ● A rule-based approach for the automatic extraction of salient scientific entities
from Computational Linguistics (CL) scholarly article titles.
○ Salient: Those entities that constitute the original contribution of a work
● Motivation
○ Align the extraction objective with existing digital libraries like the Open Research
Knowledge Graph (ORKG) where our system can be integrated.
■ The ORKG is a digital library for machine-actionable knowledge about scholarly
contributions communicated in scholarly articles. https://www.orkg.org/orkg/
■ Benefit: With intelligent analytics over KGs researchers can easily track research
progress without the cognitive overhead that reading dozens of articles impose.
A typical dilemma then with building such an KG is deciding the type of information to be represented. In
other words, what would be the information constituent candidates for an KG that reflects the overview?
Scientific Entity Extraction: Our Work
5
6. ● A rule-based approach for the automatic extraction of salient scientific entities
from Computational Linguistics (CL) scholarly article titles.
○ Salient: Those entities that constitute the original contribution of a work
● Why titles?
○ They are succinct formulations of salient aspects of the contribution of a research work
Scientific Entity Extraction: Our Work
6
7. ● A rule-based approach for the automatic extraction of salient scientific entities
from Computational Linguistics (CL) scholarly article titles.
○ Salient: Those entities that constitute the original contribution of a work
● Why a rule-based approach?
○ Titles are written with lexico-syntactic pattern regularities that are generalizable as a set of
extraction heuristics.
○ It is lightweight and works out-of-the-box without the need for sophisticated computational
resources while being nonetheless effective in satisfying the extraction objective.
■ Note that supervized machine learning models in the present age of neural models rely on
sophisticated computational hardware in terms of GPUs, RAM etc.
Scientific Entity Extraction: Our Work
7
8. ● A rule-based approach for the automatic extraction of salient scientific entities
from Computational Linguistics (CL) scholarly article titles.
○ Salient: Those entities that constitute the original contribution of a work
● To our best knowledge, a corpus of only article titles remains as yet
comprehensively unexplored as a resource for scholarly knowledge graph
building. Thus, our work sheds a unique and novel light on SKG construction
representing research overviews with a rule-based system.
Scientific Entity Extraction: Our Work
8
9. Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
9
10. Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
10
11. ● We downloaded all the article titles in the ACL anthology as the `Full Anthology
as BibTeX' file dated 1-02-2021.
○ See https://aclanthology.org/anthology.bib.gz
● From a total of 60,621 titles, the evaluation corpus comprised 50,237 titles after
eliminating duplicates and invalid titles.
Raw Dataset
11
12. Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
12
13. Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
13
14. ● Definitions of Scientific Concept Types considered in our work
1. Research problem: The theme of the investigation. E.g., “Natural Language Inference”
2. Resource: Names of existing data and other references to utilities like the Web, Encyclopedia,
etc., used to address the research problem or used in the solution. E.g., “Using Encyclopedic
Knowledge for Automatic Topic Identification.”
3. Tool: A tool can be seen as a type of a resource and specifically software. E.g., BERT.
4. Solution: A novel contribution of a work that solves the research problem. E.g., from the title
“PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation,” the terms
“PHINC” and “A Parallel Hinglish Social Media Code-Mixed Corpus” are solutions for the problem
“Machine Translation.”
5. Language: The natural language focus of a work. E.g., Breton, Lakota, etc.
6. Method: They refer to existing protocols used to support the solution; found by asking “How?”
Pattern-based Approach to Scientific Entities Extraction
14
15. ● Formalism
Every CL title T can be expressed as one or more of the following six elements tei =
<rpi, resi, tooli, langi, soli, methi>, representing the research problem, resource, tool,
language, solution, and method concepts, respectively. A title can contain terms for zero
or more of any of the concepts.
The goal of CL-Titles-Parser, for every title ti, to annotate its title expression tei,
involving scientific term extraction and term concept typing.
Pattern-based Approach to Scientific Entities Extraction
15
16. ● CL-Titles-Parser operates in a two-step workflow.
○ First, it aggregates titles as eight main template types with a default ninth category
for titles that could not be clustered by any of the eight templates.
○ Second, within each group, heuristics are applied to phrase-chunk and concept-type
the titles based on group-specific lexico-syntactic patterns.
Pattern-based Approach to Scientific Entities Extraction
16
17. ● CL-Titles-Parser operates in a two-step workflow.
○ Step 1: Titles are clustered based on commonly shared lexico-syntactic patterns.
Template “hasSpecialCaseWord()”
applies to titles written in two parts -- a one-word solution name, a colon separator, and an
elaboration of the solution name.
E.g., “SNOPAR: A Grammar Testing System” consisting of the one word “SNOPAR” solution
name and its elaboration “A Grammar Testing System.”
Pattern-based Approach to Scientific Entities Extraction
17
18. ● CL-Titles-Parser operates in a two-step workflow.
○ Step 1: Titles are clustered based on commonly shared lexico-syntactic patterns.
Template “hasSpecialCaseWord()”
applies to titles written in two parts -- a one-word solution name, a colon separator, and an
elaboration of the solution name.
There are other instances of titles belonging to this template type that are complex
sentences, i.e. titles with additional prepositional or verb phrases, where mentions of the
research problem, tool, method, language domain etc. are also included in the latter part of
the title.
E.g., “GRAFON: A Grapheme-to-Phoneme Conversion System for Dutch” is a complex title
with a prepositional phrase triggered by “for” specifying the language domain “Dutch.”
Pattern-based Approach to Scientific Entities Extraction
18
19. ● CL-Titles-Parser operates in a two-step workflow.
○ Step 2: Precedence-ordered scientific term extraction and typing rules are applied.
Works in two steps
1. Determining the “connector” positions within a title. The title is chunked at these
“connector” positions.
Our connectors are a collection of 11 prepositions and 1 verb defined as:
connectorsrx = (to|of|on|for|from|with|by|via|through|using|in|as)
Pattern-based Approach to Scientific Entities Extraction
19
20. ● CL-Titles-Parser operates in a two-step workflow.
○ Step 2: Precedence-ordered scientific term extraction and typing rules are applied.
Works in two steps
2. Based on the number of connectors, the title is processed within a precedence-ordered
set of heuristics workflow for concept typing.
E.g., if a title has one connector, it enters first into the OneConnectorHeu() branch.
There, the first step is determining which connector is in the phrase. Then based on the
connector, separate sets of concept typing precedence rules apply. E.g., if the connector is
“from” the title subphrases are typed based on the following pattern: solution from resource.
Pattern-based Approach to Scientific Entities Extraction
20
21. Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
21
22. Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
22
23. ● CL-Titles-Parser when applied to 50,237 extracted 19,799 research problem,
18,111 solution, 20,033 resource, 1,059 language, 6,878 tool, and 21,687
method. These scientific concept lists were then evaluated for extraction
precision.
● Precision = total correctly extracted concepts / total extracted concepts
Evaluations: Experimental Setup
23
24. Evaluations: Extraction Precision Results
24
Concept Type Precision
language 95.12%
resource 86.96%
tool 83.40%
solution 80.77%
method 77.29%
research problem 58.09%
25. Evaluations: Extraction Precision Results
25
Concept Type Precision
language 95.12%
resource 86.96%
tool 83.40%
solution 80.77%
method 77.29%
research problem 58.09%
Extraction heuristics for language were most precise. Relies on a regex list of
languages, therefore in a sense is limited by the list for recall.
But this is characteristic of rule-based systems. Our list is quite large covering
various obscure languages. A zero-shot machine learning approach would be
an alternative to experiment with.
26. Evaluations: Extraction Precision Results
26
Concept Type Precision
language 95.12%
resource 86.96%
tool 83.40%
solution 80.77%
method 77.29%
research problem 58.09%
The gold-standard list curation was biased toward already familiar research problems or
their derivations. Thus we estimate that at least 20% terms were pruned in the gold data
because they were relatively new as opposed to being incorrect.
27. Plan for the Talk
● Raw Dataset
● Pattern-based approach to Scientific Entities Extraction
● Evaluations
27
28. ● A qualitative analysis of the extracted terms for titles written in the 20th vs. the 21st
centuries was performed and the outcome of the most frequently used entities was
indeed indicative of the times.
○ E.g., social media channels like Twitter, the web, or online encyclopedia like
Wikipedia are predominant resources in the 21st century. This is contrast to text,
discourse, dialogues, parse trees leveraged as resources in the 20th century.
● We proposed an incremental step toward the larger goal of generating contributions-
focused SKGs.
○ The absence of inter-annotator agreement scores to determine the reliability with
which the concepts can be selected will also be addressed in future work.
● Our code is publicly available on Github: https://github.com/jd-coderepos/cl-titles-parser/
Conclusion: Takeaways
28
29. Happy to take questions
Thank you for your attention!
29