Micromeritics - Fundamental and Derived Properties of Powders
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
1. Data description & conversion
Application and examples
Lemon-aid: using lemon to aid quantitative
historical linguistic analysis
Steven Moran & Martin Br¨ummer
University of Zurich & University of Leipzig
1 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
2. Data description & conversion
Application and examples
Goals
Convert dictionary and wordlist data into the Lexicon Model
for Ontologies, aka lemon
Leverage Linked Data (LD) to combine disparate lexical
resources (50+) from the QuantHistLing research unit
Resulting LD resources provide researchers with:
more linguistic data in the Linguistic Linked Open Data Cloud
(LLOD)
a translation graph to query across the underlying lexicons and
dictionaries to extract semantically-aligned wordlists
2 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
3. Data description & conversion
Application and examples
Talk map
Data description & conversion
Ontological model
Application and examples
3 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
4. Data description & conversion
Application and examples
Overview
Data
Conversion
QuantHistLing project
Quantitative Historical Linguistics (QuantHistLing) research
unit aims to uncover and clarify phylogenetic relationships
between native South American languages using quantitative
methods
http://quanthistling.info/
There are two main objectives of the project:
Digitalization of lexical resources on South American
languages and
The development of computer-assisted methods and
algorithms to quantitatively analyze the digitized data
4 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
5. Data description & conversion
Application and examples
Overview
Data
Conversion
QuantHistLing data
QuantHistLing aims to digitize around 500 works, most of
which are currently only available in print and many of which
are the only resources available for the languages that they
describe
http://quanthistling.info/index.php?id=resources
5 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
6. Data description & conversion
Application and examples
Overview
Data
Conversion
QuantHistLing data
Simple data output format that contains metadata (prefixed
with “@”) and tab-delimited lexical output
6 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
7. Data description & conversion
Application and examples
Overview
Data
Conversion
QuantHistLing data format
The first row following the metadata contains the data header
with the fields: QLCID, HEAD, HEAD DOCULECT,
TRANSLATION, TRANSLATION DOCULECT
They correspond respectively to the internal QLC unique
identifier, the headword in the dictionary, the doculect of the
headword (or in other words the language which this
particular document describes), the translation for the given
headword, and the doculect that the translation is given in
For each resource a data dump with the same format is
provided by the project
7 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
8. Data description & conversion
Application and examples
Overview
Data
Conversion
Conversion from QHL-to-LD
We convert the QLC data into Linked Data that conforms to
the Lemon model with a simple Python script
Lemon is an ontological model for modeling lexicons and
machine-readable dictionaries for linking to the Semantic Web
and the Linked Data cloud
http://lemon-model.net/
Lemon developers also active in the W3C Ontology-Lexica
Community Group
Goal is to “develop models for the representation of lexica (and
machine readable dictionaries) relative to ontologies”
http://www.w3.org/community/ontolex/
8 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
9. Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
9 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
10. Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
10 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
11. Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
11 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
12. Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
12 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
13. Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
13 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
14. Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
According to the Lemon model, senses should link to an
ontological entity like a DBpedia resource
However, we only have the strings representing the word forms
which is not enough data to reliably link to other knowledge
bases
we argue that it is linguistically correct to link the senses of
the entries (instead of their word forms) to their respective
translations
so the ‘sense’ resources serve a purpose, even if they don’t
link the meaning of the entry
14 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
15. Data description & conversion
Application and examples
Application
Examples
Conclusion
Application
A major goal in historical-comparative linguistics is the
identification of cognates, i.e. sets of words in genealogically
related languages that have been derived from a common
word or root (e.g. English ‘is’, German ‘ist’, Latin ‘est’, from
Indo-European ‘esti’).
Modeling dictionaries and lexicons in a pivot ontology using
overlaps in translations is one way to merge several resources
into one RDF graph for querying and extracting
semantically-aligned wordlists, which can then be used as
input into computational historical linguistics tools such as
LingPy (List and Moran, 2013).
15 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
16. Data description & conversion
Application and examples
Application
Examples
Conclusion
Application
As a first step, we have converted the QHL data into RDF
and it is available online through a SPARQL endpoint.
http://linked-data.org/sparql/ (preliminary)
http://quanthistlist.info/lod/ (coming soon)
a dump is available at http://linked-data.org/datasets/
Querying the combined dictionaries and lexicons is
straightforward
16 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
17. Data description & conversion
Application and examples
Application
Examples
Conclusion
Return all triples
returns over 3.8 million triples
17 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
18. Data description & conversion
Application and examples
Application
Examples
Conclusion
Pairs of languages in the translation graph that contain
written forms for the lexical sense “casa”
18 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
19. Data description & conversion
Application and examples
Application
Examples
Conclusion
Implementation of QHL data in Lemon
19 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
20. Data description & conversion
Application and examples
Application
Examples
Conclusion
Languages in the translation graph that contain written
forms for the lexical sense “casa”
66 language pairs
marked entry shows word forms are not normalized and
contain data that can be analyzed further
20 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
21. Data description & conversion
Application and examples
Application
Examples
Conclusion
Queries
Can be easily extended to incorporate entire wordlists, such as
the Swadesh list (Swadesh, 1952) or Leipzig-Jakarta list
(Tadmor et al., 2010)
The combination of disparate data from many dictionaries and
lexicons is a first step in a computational historical linguistics
pipeline
Results are given in the source documents’ orthographic
representations
They must be normalized into an interlingual pivot, such as
the International Phonetic Alphabet, if phonetic or phonemic
analysis is to be applied to the data
Next step before producing phonetic alignments and cognate
judgements based on metrics and algorithms for calculating
lexical similarity
21 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
22. Data description & conversion
Application and examples
Application
Examples
Conclusion
Conclusion
From data being digitized and extracted from print resources,
we are creating machine-readable lexicons that are both
interoperable with each other (we link semantic senses using
the Lemon ontology model) and with other linguistics sources
We also interlink the resulting dictionary resources with other
language resources in the LLOD via ISO639-3 codes
22 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
23. Data description & conversion
Application and examples
Application
Examples
Conclusion
Future work - simplify!
Build algorithms that identify semantically similar
translation-pairs from terse translations
Identify that doculect translations like
“coarsely grind”
“grind up, crush well”
“grind lightly (chili pepper, millet for a quick snack)”
“grind lightly (groundnuts) with stones”
for different languages can be mapped to a simpler form such
as “to crush/grind” for initial comparative analysis
23 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
24. Data description & conversion
Application and examples
Application
Examples
Conclusion
Future work - annotate!
Use NLP Interchange Format (Hellmann et al., 2012) to keep
track of where information in the dictionaries comes from - or
in other words, use NIF combined with Lemon to annotate the
QHL data sources for provenance
24 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
25. Data description & conversion
Application and examples
Application
Examples
Conclusion
Future work - link!
Link to further resources that contain linguistic and
non-linguistic information
Typological data and geographic variables that may provide
useful information for determining the genealogical and
geographical relatedness of languages
25 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
26. Data description & conversion
Application and examples
Application
Examples
Conclusion
Many thanks!
Organiziers and participants of LDL-2013
QuantHistLing Research Unit - Univeristy of Marburg
(Michael Cysouw, PI)
University of Zurich, University of Marburg and University of
Leipzig
26 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing