Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

1. An approach to unsupervised historical text normalisation Petar Mitankin Sofia University FMI Stefan Gerdjikov Sofia University FMI Stoyan Mihov Bulgarian Academy of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May

2. An approach to unsupervised historical text normalisation Petar Mitankin Sofia University FMI Stefan Gerdjikov Sofia University FMI Stoyan Mihov Bulgarian Academy of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May

3. Contents ● Supervised Text Normalisation – CULTURA – REBELS Translation Model – Functional Automata ● Unsupervised Text Normalisation – Unsupervised REBELS – Experimental Results – Future Improvements

4. Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

5. Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

6. Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model

7. Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model

8. REgularities Based Embedding of Language Structures shee REBELS Translation Model he / -1.89 se / -1.69 she / -9.75 shea / -10.04 Automatic Extraction of Historical Spelling Variations

9. Training of The REBELS Translation Model ● Training pairs from the ground truth: (shee, she), (maye, may), (she, she), (tyme, time), (saith, says), (have, have), (tho:, thomas), ...

10. Training of The REBELS Translation Model ● Deterministic structure of all historical/modern subwords ● Each word has several hierarchical decompositions in the DAWG: Hierarchical decomposition of each historical word Hierarchical decomposition of each modern word

11. Training of The REBELS Translation Model ● For each training pair (knowth, knows) we find a mapping between the decompositions: ● We collect statistics about historical subword -> modern subword ● We collect statistics about historical subword -> modern subword

12. REgularities Based Embedding of Language Structures shee REBELS Translation Model he / -1.89 se / -1.69 she / -9.75 shea / -10.04 REBELS generates normalisation candidates for unseen historical words

13. shee REBELS knowth REBELS me REBELS shee knowth me

14. relevance score (he knuth my) = REBELS TM (he knuth my) * C_tm + Statistical Language Model (he knuth my)*C_lm Combination of REBELS with Statistical Bigram Language Model ● Bigram Statistical Model – Smoothing: Absolute Discounting, Backing-off – Gutengberg English language corpus

15. Functional Automata L(C_tm, C_lm) is represented with Functional Automata

16. Automatic Construction of Functional Automaton For The Partial Derivative w.r.t. x L(C_tm, C_lm) is optimised with the Conjugate Gradient method

17. Supervised Text Normalisation REBELS Translation Model Search Module Based on Functional Automata Ground Truth Training Module Based on Functional Automata Historical text Normalised text

18. Unsupervised Text Normalisation REBELS Translation Model Unsupervised Generation of Training Pairs (knoweth, knows) Historical text Normalised text Search Module Based on Functional Automata

19. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.

23. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.

24. Normalisation of the 1641 Depositions. Experimental results Method Generation of REBELS Training Pairs Spelling Probabilities Language Model Accuracy BLEU 1 ---- ---- ---- 75.59 50.31 2 Unsupervised NO YES 67.84 45.52 3 Unsupervised YES NO 79.18 56.55 4 Unsupervised YES YES 81.79 61.88 5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78 6 Supervised Supervised Trained Supervised Trained 93.96 87.30

25. Future Improvement REBELS Translation Model Unsupervised Generation of Training Pairs (knoweth, knows) with probabilities Historical text Normalised text Search Module Based on Functional Automata MAP Training Module

26. Thank You! Comments / Questions? ACKNOWLEDGEMENTS The reported research work is supported by the project CULTURA, grant 269973, funded by the FP7Programme and the project AComIn, grant 316087, funded by the FP7 Programme.

Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation