A very high level poster explaining my research. This was presented as part of Manchester Doctoral College's research showcase. Lexical simplification is the process of automatically improving the understandability of a text by identifying difficult words and replacing them with easier alternatives. My research has so far exposed the high error rate in this process and has attempted to mitigate these errors.
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Lexical Simplification - University of Manchester Postgraduate Summer Research Showcase
1. Lexical Simplification
Matthew Shardlow
http://lexicalsimplification.blogspot.com/
Abstract
We live in an information based society where text is ubiquitous.
However, public information is often too difficult for the intended
audience. Increasingly, more and more information is presented
via digital media. Automatic processes can be used to improve the
readability of a text. Lexical simplification makes text easier to
understand. Difficult words are replaced with easier alternatives.
This can be done before a user ever sees the original difficult text.
This PhD focusses on the errors that arise during simplification.
Novel evaluation measures are introduced. A variety of areas will
benefit from automated simplification.
The Pipeline
“The protestor was arrested”
Output Text
Substitution Ranking
1) Protestor
2) Activist
Sense Disambiguation
Campaigner: Protestor,
Activist, Advocate
Substitution Generation
Campaigner: Protestor,
Activist, Advocate
Complex Word Discovery
The campaigner was. . .
“The campaigner was arrested”
Input Text
• Simplest synonym selected.
• Treated as a ranking task.
• Decide which words will fit.
• Must consider context.
• Find suitable replacements.
• Thesaurus look up.
• Difficult words identified.
• Depends on context.
The Applications
Usage Description
Language Learners Easy to read material in target language.
Stroke Victims Access to easy to read information pro-
motes rehabilitation and self confidence.
Medical Patients Improved access to medical information
improves patient knowledge and care.
Consumers Better understanding of technical legal
language in licence agreements.
Academics Support when reading material from
outside of main discipline.
Public Engagement Tools to help authors produce jargon
free text for a lay audience.
The Problem
• Errors occur in the pipeline, affecting text quality.
• Low text quality results in poor understandability.
• The process can result in text being translated to nonsense.
• My research has categorised the errors as follows:
Type 2: A complex or a simple word may be assigned to the
wrong category.
Type 3: No substitutions which would result in a simplification
of the target word are available.
Type 4: Sense disambiguation error. The meaning of the sen-
tence has changed significantly.
Type 5: Ranking Error. A replacement which does not simplify
the sentence has been selected.
• In a recent study [2] I found the frequency of each error to be:
0 %
10 %
20 %
30 %
40 %
50 %
60 %
70 %
Type 2 Type 3 Type 4 Type 5
ErrorFrequency
Error Code
65.03%
42.19%
29.73%
26.92%
The Research
Pipeline and Errors
• A literature survey has identified focus areas [1].
• An error study has highlighted the importance of each area [2].
• Ongoing work will refine the error study.
Complex Word Identification
• The CW corpus has been developed using simple Wikipedia [3].
• Techniques to identify complex words have been evaluated [4].
Substitution Generation
• Initial research has shown problems with traditional thesauri.
• Thesaurus augmentation depends on the specific domain.
Word Sense Disambiguation
• Many systems exist for the task of disambiguation.
• Several top disambiguation systems evaluated for simplification.
• Research awaiting submission.
Substitution Ranking
• Depends heavily on the context and the user.
• Research will look at the needs of individual users.
Applications
• Simplification will target academic literature.
• Target audience will be lay readers.
References
[1] Shardlow, M. 2014. A Survey of Automated Text Simplification. IJACSA Spe-
cial Issue on Natural Language Processing.
[2] Shardlow, M. 2014. Out in the Open: Finding and Categorising Errors in the
Lexical Simplification Pipeline. LREC, Reykjavik, Iceland, May. ELRA.
[3] Shardlow, M. 2013. The CW Corpus: Evaluating the Identification of Complex
Words. PITR, Sofia, Bulgaria, ACL.
[4] Shardlow, M. 2013. A Comparison of Techniques to Automatically Identify
Complex Words. ACL Student Research Workshop, Sofia, Bulgaria, ACL