Slides of the paper Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii by Juri Opitz, Leo Born, Vivi Nastase and Yannick Pultar at the 3rd Edition of the DATeCH2019 International Conference
1. Automatic Reconstruction of Emperor Itineraries
from the Regesta Imperii
Juri Opitz, Leo Born, Vivi Nastase, Yannick Pultar
2. The RI corpus
● more than 150,000 “regests”
○ abstracts of charters issued by the Holy Roman Emperors
■ and also events (battles, births, etc.)
○ reference time span: almost 1,000 years
○ starting from the Carolingian dynasty
○ ….to Maximilian I
9. To sum up...
● It’s not so easy to map regest place names from a time span of almost 1k years onto maps
● To address this, we engage two main problems:
○ place name prediction: many place names are unknown or return zero candidates
○ coordinate prediction: place name queries return large candidate sets and the correct point must be chosen
10. Place name prediction
● Experiments with Logistic regression
○ features: last known place name, text uni grams, emperor
● baseline 1: most-frequent-place name
● baseline 2: last known place name
○ closest possible anterior regest, time-wise
11. Place name prediction results
% of issuers where method performed best
% of correct choices
% correct choices - mean over all issuers
12. For every place name...
● the place name predictions made sure that we have a non-empty set of candidates points
○ problem: sometimes the correct point is not contained in the candidate set (Future work)
13. Coordinate prediction
● we model the itinerary of an emperor in a DAG
● Assumption: lowest-cost path approximates the true itinerary
15. Edge cost heuristic
bias towards crowded places
(many places of medieval significance are still crowded
today, e.g. Rome, Nuremberg, etc.)
straight line distance
bias towards high ranked results
bias towards exact name matches
(we want to keep unexact matches, e.g.
Franckfurt -> autocorrect -> Frankfurt)
16. Shortest path selection enables us to obtain...
● for every regest/event a tuple of predicted lat-lng coordinates
● additionally we compute centroids
○ i.e. for every place name we compute the most centered coordinate
○ many and frequent place names have unequivocal points of reference (Rome, Nuremberg, etc.)
17. Gold standard
● Gold standard
○ appr. 10k place names manually resolved by HiWi interns on a place name level
■ this means that the gold standard cannot possibly account for the case where a king visited two places
of same name but different locations
○ our resolutions: event-level
○ nevertheless, it’s the best we have to evaluate against
18. Results of different path searches vs. time
very hard
even for historians
Staufer’s
Italian travels
dist. to gold,
lower = better
22. Conclusions
● optimal path better predictions than greedy and much better than random
○ evidence that our edge cost heuristic formula contains some useful information
● method can capture human annotation errors
● in some time periods, places are much harder to resolve than in others
23. Future work
● improve place name prediction
○ try time-series prediction models which model geo-spatial-temporal context better
○ place name normalization (Franckfurt, Vrankenforde, Franckenfurt → Frankfurt a. Main)
● improve coordinate prediction
○ improve cost heuristic
○ try historian place gazetteers instead of modern geo data bases
■ caveat: how well will they generalize across Europe and over almost 1k years?
● mine and resolve the rich place names and place name references inside the texts
○ difficult but yields new large-scale resources and options for statistical historic itinerary research!