SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
An approach to unsupervised
historical text normalisation
Petar Mitankin
Sofia University
FMI
Stefan Gerdjikov
Sofia University
FMI
Stoyan Mihov
Bulgarian Academy
of Sciences
IICT
DATeCH 2014, Maye 19 - 20, Madrid, Spain
May
An approach to unsupervised
historical text normalisation
Petar Mitankin
Sofia University
FMI
Stefan Gerdjikov
Sofia University
FMI
Stoyan Mihov
Bulgarian Academy
of Sciences
IICT
DATeCH 2014, Maye 19 - 20, Madrid, Spain
May
Contents
● Supervised Text Normalisation
– CULTURA
– REBELS Translation Model
– Functional Automata
● Unsupervised Text Normalisation
– Unsupervised REBELS
– Experimental Results
– Future Improvements
Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions,
8022 documents, 17th
century Early Modern
English
● CULTURA: CULTivating Understanding and
Research through Adaptivity
● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND
TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA
DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA
UNIVERSITY ST KLIMENT OHRIDSKI
Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions,
8022 documents, 17th
century Early Modern
English
● CULTURA: CULTivating Understanding and
Research through Adaptivity
● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND
TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA
DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA
UNIVERSITY ST KLIMENT OHRIDSKI
Supervised Text Normalisation
● Manually created ground truth
– 500 documents from the 1641 Depositions
– All words: 205 291
– Normalised words: 51 133
● Statistical Machine Translation from historical
language to modern language combines:
– Translation model
– Language model
Supervised Text Normalisation
● Manually created ground truth
– 500 documents from the 1641 Depositions
– All words: 205 291
– Normalised words: 51 133
● Statistical Machine Translation from historical
language to modern language combines:
– Translation model
– Language model
REgularities Based Embedding of
Language Structures
shee
REBELS
Translation
Model
he / -1.89
se / -1.69
she / -9.75
shea / -10.04
Automatic Extraction of
Historical Spelling Variations
Training of
The REBELS Translation Model
● Training pairs from the ground truth:
(shee, she), (maye, may), (she, she),
(tyme, time), (saith, says), (have, have),
(tho:, thomas), ...
Training of
The REBELS Translation Model
● Deterministic structure of all historical/modern
subwords
● Each word has several hierarchical
decompositions in the DAWG:
Hierarchical
decomposition of each
historical word
Hierarchical
decomposition of each
modern word
Training of
The REBELS Translation Model
● For each training pair (knowth, knows) we find a
mapping between the decompositions:
● We collect statistics about
historical subword -> modern subword
● We collect statistics about
historical subword -> modern subword
REgularities Based Embedding of
Language Structures
shee
REBELS
Translation
Model
he / -1.89
se / -1.69
she / -9.75
shea / -10.04
REBELS generates
normalisation candidates for
unseen historical words
shee
REBELS
knowth
REBELS
me
REBELS
shee knowth me
relevance score (he knuth my) =
REBELS TM (he knuth my) * C_tm +
Statistical Language Model (he knuth my)*C_lm
Combination of REBELS with Statistical
Bigram Language Model
● Bigram Statistical Model
– Smoothing: Absolute Discounting, Backing-off
– Gutengberg English language corpus
Functional Automata
L(C_tm, C_lm) is represented with
Functional Automata
Automatic Construction of
Functional Automaton For The
Partial Derivative w.r.t. x
L(C_tm, C_lm) is optimised with the Conjugate
Gradient method
Supervised Text Normalisation
REBELS
Translation
Model
Search
Module
Based on
Functional
Automata
Ground
Truth
Training
Module
Based on
Functional
Automata
Historical
text
Normalised
text
Unsupervised Text Normalisation
REBELS
Translation
Model
Unsupervised
Generation of
Training Pairs
(knoweth, knows)
Historical
text
Normalised
text
Search
Module
Based on
Functional
Automata
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical word H:
● If H is a modern word, then generate (H,H) , else
● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are
found, then
● Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H
and generate (H,M).
● If too many (> 5) modern words were generated for H,
then do not use the corresponding pairs for training.
Normalisation of the 1641
Depositions. Experimental results
Method
Generation of
REBELS
Training
Pairs
Spelling
Probabilities
Language Model Accuracy BLEU
1 ---- ---- ---- 75.59 50.31
2 Unsupervised NO YES 67.84 45.52
3 Unsupervised YES NO 79.18 56.55
4 Unsupervised YES YES 81.79 61.88
5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78
6 Supervised Supervised Trained Supervised Trained 93.96 87.30
Future Improvement
REBELS
Translation
Model
Unsupervised
Generation of
Training Pairs
(knoweth, knows)
with probabilities
Historical
text
Normalised
text
Search
Module
Based on
Functional
Automata
MAP
Training
Module
Thank You!
Comments / Questions?
ACKNOWLEDGEMENTS
The reported research work is supported by
the project CULTURA, grant 269973, funded by the FP7Programme and
the project AComIn, grant 316087, funded by the FP7 Programme.

Weitere ähnliche Inhalte

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

  • 1. An approach to unsupervised historical text normalisation Petar Mitankin Sofia University FMI Stefan Gerdjikov Sofia University FMI Stoyan Mihov Bulgarian Academy of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May
  • 2. An approach to unsupervised historical text normalisation Petar Mitankin Sofia University FMI Stefan Gerdjikov Sofia University FMI Stoyan Mihov Bulgarian Academy of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May
  • 3. Contents ● Supervised Text Normalisation – CULTURA – REBELS Translation Model – Functional Automata ● Unsupervised Text Normalisation – Unsupervised REBELS – Experimental Results – Future Improvements
  • 4. Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
  • 5. Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
  • 6. Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model
  • 7. Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model
  • 8. REgularities Based Embedding of Language Structures shee REBELS Translation Model he / -1.89 se / -1.69 she / -9.75 shea / -10.04 Automatic Extraction of Historical Spelling Variations
  • 9. Training of The REBELS Translation Model ● Training pairs from the ground truth: (shee, she), (maye, may), (she, she), (tyme, time), (saith, says), (have, have), (tho:, thomas), ...
  • 10. Training of The REBELS Translation Model ● Deterministic structure of all historical/modern subwords ● Each word has several hierarchical decompositions in the DAWG: Hierarchical decomposition of each historical word Hierarchical decomposition of each modern word
  • 11. Training of The REBELS Translation Model ● For each training pair (knowth, knows) we find a mapping between the decompositions: ● We collect statistics about historical subword -> modern subword ● We collect statistics about historical subword -> modern subword
  • 12. REgularities Based Embedding of Language Structures shee REBELS Translation Model he / -1.89 se / -1.69 she / -9.75 shea / -10.04 REBELS generates normalisation candidates for unseen historical words
  • 14. relevance score (he knuth my) = REBELS TM (he knuth my) * C_tm + Statistical Language Model (he knuth my)*C_lm Combination of REBELS with Statistical Bigram Language Model ● Bigram Statistical Model – Smoothing: Absolute Discounting, Backing-off – Gutengberg English language corpus
  • 15. Functional Automata L(C_tm, C_lm) is represented with Functional Automata
  • 16. Automatic Construction of Functional Automaton For The Partial Derivative w.r.t. x L(C_tm, C_lm) is optimised with the Conjugate Gradient method
  • 17. Supervised Text Normalisation REBELS Translation Model Search Module Based on Functional Automata Ground Truth Training Module Based on Functional Automata Historical text Normalised text
  • 18. Unsupervised Text Normalisation REBELS Translation Model Unsupervised Generation of Training Pairs (knoweth, knows) Historical text Normalised text Search Module Based on Functional Automata
  • 19. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
  • 20. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
  • 21. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
  • 22. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
  • 23. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.
  • 24. Normalisation of the 1641 Depositions. Experimental results Method Generation of REBELS Training Pairs Spelling Probabilities Language Model Accuracy BLEU 1 ---- ---- ---- 75.59 50.31 2 Unsupervised NO YES 67.84 45.52 3 Unsupervised YES NO 79.18 56.55 4 Unsupervised YES YES 81.79 61.88 5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78 6 Supervised Supervised Trained Supervised Trained 93.96 87.30
  • 25. Future Improvement REBELS Translation Model Unsupervised Generation of Training Pairs (knoweth, knows) with probabilities Historical text Normalised text Search Module Based on Functional Automata MAP Training Module
  • 26. Thank You! Comments / Questions? ACKNOWLEDGEMENTS The reported research work is supported by the project CULTURA, grant 269973, funded by the FP7Programme and the project AComIn, grant 316087, funded by the FP7 Programme.