Slides of the presentation of the paper An approach to Unsupervised Historical Text Normalisation by Petar Mitankin, Stefan Gerdjikov and Stoyan Mihov in DATeCH 2014. #digidays
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation
1. An approach to unsupervised
historical text normalisation
Petar Mitankin
Sofia University
FMI
Stefan Gerdjikov
Sofia University
FMI
Stoyan Mihov
Bulgarian Academy
of Sciences
IICT
DATeCH 2014, Maye 19 - 20, Madrid, Spain
May
2. An approach to unsupervised
historical text normalisation
Petar Mitankin
Sofia University
FMI
Stefan Gerdjikov
Sofia University
FMI
Stoyan Mihov
Bulgarian Academy
of Sciences
IICT
DATeCH 2014, Maye 19 - 20, Madrid, Spain
May
3. Contents
● Supervised Text Normalisation
– CULTURA
– REBELS Translation Model
– Functional Automata
● Unsupervised Text Normalisation
– Unsupervised REBELS
– Experimental Results
– Future Improvements
4. Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions,
8022 documents, 17th
century Early Modern
English
● CULTURA: CULTivating Understanding and
Research through Adaptivity
● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND
TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA
DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA
UNIVERSITY ST KLIMENT OHRIDSKI
5. Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions,
8022 documents, 17th
century Early Modern
English
● CULTURA: CULTivating Understanding and
Research through Adaptivity
● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND
TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA
DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA
UNIVERSITY ST KLIMENT OHRIDSKI
6. Supervised Text Normalisation
● Manually created ground truth
– 500 documents from the 1641 Depositions
– All words: 205 291
– Normalised words: 51 133
● Statistical Machine Translation from historical
language to modern language combines:
– Translation model
– Language model
7. Supervised Text Normalisation
● Manually created ground truth
– 500 documents from the 1641 Depositions
– All words: 205 291
– Normalised words: 51 133
● Statistical Machine Translation from historical
language to modern language combines:
– Translation model
– Language model
8. REgularities Based Embedding of
Language Structures
shee
REBELS
Translation
Model
he / -1.89
se / -1.69
she / -9.75
shea / -10.04
Automatic Extraction of
Historical Spelling Variations
9. Training of
The REBELS Translation Model
● Training pairs from the ground truth:
(shee, she), (maye, may), (she, she),
(tyme, time), (saith, says), (have, have),
(tho:, thomas), ...
10. Training of
The REBELS Translation Model
● Deterministic structure of all historical/modern
subwords
● Each word has several hierarchical
decompositions in the DAWG:
Hierarchical
decomposition of each
historical word
Hierarchical
decomposition of each
modern word
11. Training of
The REBELS Translation Model
● For each training pair (knowth, knows) we find a
mapping between the decompositions:
● We collect statistics about
historical subword -> modern subword
● We collect statistics about
historical subword -> modern subword
12. REgularities Based Embedding of
Language Structures
shee
REBELS
Translation
Model
he / -1.89
se / -1.69
she / -9.75
shea / -10.04
REBELS generates
normalisation candidates for
unseen historical words
14. relevance score (he knuth my) =
REBELS TM (he knuth my) * C_tm +
Statistical Language Model (he knuth my)*C_lm
Combination of REBELS with Statistical
Bigram Language Model
● Bigram Statistical Model
– Smoothing: Absolute Discounting, Backing-off
– Gutengberg English language corpus
19. Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
20. Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
21. Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
22. Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
23. Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If too many (> 5) modern words were generated for H,
then do not use the corresponding pairs for training.
24. Normalisation of the 1641
Depositions. Experimental results
Method
Generation of
REBELS
Training
Pairs
Spelling
Probabilities
Language Model Accuracy BLEU
1 ---- ---- ---- 75.59 50.31
2 Unsupervised NO YES 67.84 45.52
3 Unsupervised YES NO 79.18 56.55
4 Unsupervised YES YES 81.79 61.88
5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78
6 Supervised Supervised Trained Supervised Trained 93.96 87.30
26. Thank You!
Comments / Questions?
ACKNOWLEDGEMENTS
The reported research work is supported by
the project CULTURA, grant 269973, funded by the FP7Programme and
the project AComIn, grant 316087, funded by the FP7 Programme.