The document describes a semantic publishing platform prototype that was developed to extract knowledge from mathematical scholarly papers and integrate it into the Linked Open Data cloud. Key aspects of the prototype include developing ontologies to represent the mathematical domain and logical structure of papers, and methods for extracting metadata, structure, terminology, and formulas from papers in Russian. The prototype was applied to a collection of over 1,300 mathematical publications, and its performance was evaluated. Use cases demonstrate how the published semantic data can enable searches over theoretical findings and linking mathematical entities to DBpedia. Future work involves integrating the modules into a toolkit and expanding the approach to other domains.
DevEX - reference for building teams, processes, and platforms
Bringing Math to LOD
1. Bringing Math to LOD:
A Semantic Publishing Platform Prototype for
Scientific Collections in Mathematics
Olga Nevzorova, Nikita Zhiltsov, Danila Zaikin, Olga Zhibrik,
Alexander Kirillovich, Vladimir Nevzorov, Evgeniy Birialtsev
Kazan Federal University
Russia
October 23, 2013
1 / 29
3. Our Contribution
Our prototype is geared to build a semantic graph of
mathematical knowledge objects, that
is extracted from a collection of mathematical
scholarly papers, and
is integrated into the LOD «cloud»
3 / 29
4. Research Output
IVM Data Set
LOD representation of 1 330 scholarly publications of
the «Izvestiya Vuzov. Matematika» (IVM) journal
Covers the semantics of:
article metadata
elements of the logical structure
terminology
formulas
Aligned with DBpedia, CORDIS
More than 850 000 RDF triples
SPARQL endpoint:
http://cll.niimm.ksu.ru:8890/sparql-auth∗
∗
the SPARQL endpoint is secured. Please email the authors for credentials
4 / 29
5. Related Work
Domain-specific languages: OMDoc, MathLang
Domain models: Cambridge Mathematical
Thesaurus, DBpedia (math-related part),
ScienceWISE Ontology
Math-related NLP: mArachna; linguistic modules of
arXMLiv
5 / 29
7. Key Research Contributions
a thorough ontological model of the mathematical
domain
an ontology-based language-independent method for
extraction of logical structure elements in papers
an ontology-based method for extraction of
mathematical named entities from texts in Russian
a method that connects mathematical named entities
to symbolic expressions
7 / 29
10. Ontology of Structural Elements (1)
http://cll.niimm.ksu.ru/ontologies/mocassin
Covers 15 common structural elements:
Defines 9 object properties and 4 datatype properties:
10 / 29
11. Ontology of Structural Elements (2)
http://cll.niimm.ksu.ru/ontologies/mocassin
3 cardinality axioms, e.g.
Proof ∧ (= 1 proves ProvableStatement† )
2 transitivity axioms for hasPart and dependsOn
properties
DL expressivity: SRIN (D)
†
i.e., Claim ∨ Corollary ∨ Lemma ∨ Proposition ∨ Theorem
11 / 29
12. Ontology of Mathematical Concepts (1)
http://cll.niimm.ksu.ru/ontologies/mathematics
Covers 3 450 mathematical concepts
Defines commonly used terms as well as terms from
the emerging professional vocabulary (e.g.
Bitsadze-Samarsky problem)
Supports Russian/English labels
12 / 29
13. Ontology of Mathematical Concepts (2)
http://cll.niimm.ksu.ru/ontologies/mathematics
Includes two taxonomies:
taxonomy of mathematical theories‡ :
number theory, set theory, algebra, analysis, geometry,
mathematical logic, discrete mathematics, theory of
computation, differential equations, numerical analysis,
probability theory and statistics
taxonomy of mathematical objects
Covers common scientific concepts, such as Problem,
Method, Statement, Formula etc.
DL expressivity: ALCHI
‡
covers just a part of the mathematical knowledge
13 / 29
14. Ontology of Mathematical Concepts (3)
Object properties
belongsTo/contains, e.g.
Barycentric Coordinates belongsTo Metric Geometry
defines/isDefinedBy, e.g.
Christoffel Symbol isDefinedBy Connectedness
seeAlso, e.g.
Chebyshev Iterative Method seeAlso Numerical Solution of
Linear Equation Systems
14 / 29
15. Ontology of Mathematical Concepts (4)
Stats
3 450 classes
27% of classes are mapped onto DBpedia
3 630 subclass-of property instances
1 140 other object property instances
Common facts about the development:
lasted for 4 months
7 pro mathematicians participated as domain experts
guided by the authors
WebProtege was used as a collaborative tool
15 / 29
17. NLP Annotation
Relies on the OntoIntegrator facilities
Solves some of the conventional linguistic tasks, such
as:
tokenization
sentence splitting (∼ 98% F-measure§ )
morphological analysis
NP extraction (88% precision)
Special handling of math symbols, abbreviations, and
math expressions as parts of NPs
Currently supports only Russian language
§
the metrics were evaluated on real math texts with the help of
domain experts
17 / 29
18. Mining the Logical Structure
Supports our ontology of structural elements:
elements in real texts are instances of the ontology classes
Recognizing types of structural elements:
A string similarity based method gives 89%-100%
F-measure depending on the class
Recognizing semantic relations between them:
A decision tree learner gives 61%-95% F-measure
depending on the relation
18 / 29
19. Mathematical Named Entity Extraction
Supports our ontology of mathematical concepts:
assigned NPs are instances of the ontology classes
Our method employs annotations of the NP structure
and Jaccard similarity
The method gives 86% F-measure with parameters
focusing on precision/recall trade-off
19 / 29
21. Connecting Named Entities to Formulas
Parsing mathematical expressions
Detection of variables
Proximity-based matching of mathematical variables
with noun phrases at 68% accuracy
21 / 29
23. Other supported features
Article metadata extraction (title, author names,
publication year etc.) according to AKT Portal
schema
Semi-manual interlinking¶ with existing LOD data
sets: DBpedia, CORDIS
Publishing the extracted data as an LOD-compliant
RDF data set
¶
by leveraging the Silk app
23 / 29
26. Semantic Search of Theoretical Findings
Finding articles with theorems about finite groups
PREFIX moc: <http://cll.niimm.ksu.ru/ontologies/mocassin#>
PREFIX math: <http://cll.niimm.ksu.ru/ontologies/mathematics#>
SELECT ?article WHERE {
?article moc:hasSegment ?theorem .
?theorem moc:mentions ?entity; a moc:Theorem .
?entity a math:E2183
}
26 / 29
27. Conclusion
We have developed a holistic approach for mining
LOD representation of scholarly papers in
mathematics
We applied the prototype to a collection of over
1 300 real math papers
We conducted a thorough evaluation of the proposed
methods with the help of domain experts
We provided several use cases to illustrate the utility
of the published data
27 / 29
28. Future Work
Integrating all the modules into a full-fledged toolkit
Add support of English to the NLP module
Extend our approach to texts on other natural
science domains
28 / 29