SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
Lexical Simplification
Matthew Shardlow
http://lexicalsimplification.blogspot.com/
Abstract
We live in an information based society where text is ubiquitous.
However, public information is often too difficult for the intended
audience. Increasingly, more and more information is presented
via digital media. Automatic processes can be used to improve the
readability of a text. Lexical simplification makes text easier to
understand. Difficult words are replaced with easier alternatives.
This can be done before a user ever sees the original difficult text.
This PhD focusses on the errors that arise during simplification.
Novel evaluation measures are introduced. A variety of areas will
benefit from automated simplification.
The Pipeline
“The protestor was arrested”
Output Text
Substitution Ranking
1) Protestor
2) Activist
Sense Disambiguation
Campaigner: Protestor,
Activist, Advocate
Substitution Generation
Campaigner: Protestor,
Activist, Advocate
Complex Word Discovery
The campaigner was. . .
“The campaigner was arrested”
Input Text
• Simplest synonym selected.
• Treated as a ranking task.
• Decide which words will fit.
• Must consider context.
• Find suitable replacements.
• Thesaurus look up.
• Difficult words identified.
• Depends on context.
The Applications
Usage Description
Language Learners Easy to read material in target language.
Stroke Victims Access to easy to read information pro-
motes rehabilitation and self confidence.
Medical Patients Improved access to medical information
improves patient knowledge and care.
Consumers Better understanding of technical legal
language in licence agreements.
Academics Support when reading material from
outside of main discipline.
Public Engagement Tools to help authors produce jargon
free text for a lay audience.
The Problem
• Errors occur in the pipeline, affecting text quality.
• Low text quality results in poor understandability.
• The process can result in text being translated to nonsense.
• My research has categorised the errors as follows:
Type 2: A complex or a simple word may be assigned to the
wrong category.
Type 3: No substitutions which would result in a simplification
of the target word are available.
Type 4: Sense disambiguation error. The meaning of the sen-
tence has changed significantly.
Type 5: Ranking Error. A replacement which does not simplify
the sentence has been selected.
• In a recent study [2] I found the frequency of each error to be:
0 %
10 %
20 %
30 %
40 %
50 %
60 %
70 %
Type 2 Type 3 Type 4 Type 5
ErrorFrequency
Error Code
65.03%
42.19%
29.73%
26.92%
The Research
Pipeline and Errors
• A literature survey has identified focus areas [1].
• An error study has highlighted the importance of each area [2].
• Ongoing work will refine the error study.
Complex Word Identification
• The CW corpus has been developed using simple Wikipedia [3].
• Techniques to identify complex words have been evaluated [4].
Substitution Generation
• Initial research has shown problems with traditional thesauri.
• Thesaurus augmentation depends on the specific domain.
Word Sense Disambiguation
• Many systems exist for the task of disambiguation.
• Several top disambiguation systems evaluated for simplification.
• Research awaiting submission.
Substitution Ranking
• Depends heavily on the context and the user.
• Research will look at the needs of individual users.
Applications
• Simplification will target academic literature.
• Target audience will be lay readers.
References
[1] Shardlow, M. 2014. A Survey of Automated Text Simplification. IJACSA Spe-
cial Issue on Natural Language Processing.
[2] Shardlow, M. 2014. Out in the Open: Finding and Categorising Errors in the
Lexical Simplification Pipeline. LREC, Reykjavik, Iceland, May. ELRA.
[3] Shardlow, M. 2013. The CW Corpus: Evaluating the Identification of Complex
Words. PITR, Sofia, Bulgaria, ACL.
[4] Shardlow, M. 2013. A Comparison of Techniques to Automatically Identify
Complex Words. ACL Student Research Workshop, Sofia, Bulgaria, ACL

Weitere ähnliche Inhalte

Ähnlich wie Lexical Simplification - University of Manchester Postgraduate Summer Research Showcase

NISO Webinar: Keyword Search = "Improve Discovery Systems"
NISO Webinar: Keyword Search = "Improve Discovery Systems"NISO Webinar: Keyword Search = "Improve Discovery Systems"
NISO Webinar: Keyword Search = "Improve Discovery Systems"
National Information Standards Organization (NISO)
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu
 
Library Usability
Library UsabilityLibrary Usability
Library Usability
KimGriggs
 

Ähnlich wie Lexical Simplification - University of Manchester Postgraduate Summer Research Showcase (20)

Documentary Essay Definition
Documentary Essay DefinitionDocumentary Essay Definition
Documentary Essay Definition
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
 
InfoFest Kent 2017: Accessibility is good for you, Ben Watson
InfoFest Kent 2017: Accessibility is good for you, Ben WatsonInfoFest Kent 2017: Accessibility is good for you, Ben Watson
InfoFest Kent 2017: Accessibility is good for you, Ben Watson
 
NISO Webinar: Keyword Search = "Improve Discovery Systems"
NISO Webinar: Keyword Search = "Improve Discovery Systems"NISO Webinar: Keyword Search = "Improve Discovery Systems"
NISO Webinar: Keyword Search = "Improve Discovery Systems"
 
Impact the UX of Your Website with Contextual Inquiry
Impact the UX of Your Website with Contextual InquiryImpact the UX of Your Website with Contextual Inquiry
Impact the UX of Your Website with Contextual Inquiry
 
Owning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsOwning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your Patrons
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social media
 
Word 2 vector
Word 2 vectorWord 2 vector
Word 2 vector
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
BA and Beyond 19 Sponsor spotlight - Namahn - Beating complexity with complexity
BA and Beyond 19 Sponsor spotlight - Namahn - Beating complexity with complexityBA and Beyond 19 Sponsor spotlight - Namahn - Beating complexity with complexity
BA and Beyond 19 Sponsor spotlight - Namahn - Beating complexity with complexity
 
Norway talk #1 dual level theory ppt
Norway talk #1 dual level theory pptNorway talk #1 dual level theory ppt
Norway talk #1 dual level theory ppt
 
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
 
Primary Printable Paper. Online assignment writing service.
Primary Printable Paper. Online assignment writing service.Primary Printable Paper. Online assignment writing service.
Primary Printable Paper. Online assignment writing service.
 
Library Usability
Library UsabilityLibrary Usability
Library Usability
 
On analyzing specialized discourse in the age of digital media
On analyzing specialized discourse in the age of digital mediaOn analyzing specialized discourse in the age of digital media
On analyzing specialized discourse in the age of digital media
 
006 Why Is College Important Essay Educatio
006 Why Is College Important Essay Educatio006 Why Is College Important Essay Educatio
006 Why Is College Important Essay Educatio
 
Primo Usability: What Texas Tech Discovered When Implementing Primo
Primo Usability: What Texas Tech Discovered When Implementing PrimoPrimo Usability: What Texas Tech Discovered When Implementing Primo
Primo Usability: What Texas Tech Discovered When Implementing Primo
 
Text analytics
Text analyticsText analytics
Text analytics
 
Help Writing A Hook For An Essay - How To Write A Hook
Help Writing A Hook For An Essay - How To Write A HookHelp Writing A Hook For An Essay - How To Write A Hook
Help Writing A Hook For An Essay - How To Write A Hook
 
Fake news -final.pptx
Fake news -final.pptxFake news -final.pptx
Fake news -final.pptx
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 

Lexical Simplification - University of Manchester Postgraduate Summer Research Showcase

  • 1. Lexical Simplification Matthew Shardlow http://lexicalsimplification.blogspot.com/ Abstract We live in an information based society where text is ubiquitous. However, public information is often too difficult for the intended audience. Increasingly, more and more information is presented via digital media. Automatic processes can be used to improve the readability of a text. Lexical simplification makes text easier to understand. Difficult words are replaced with easier alternatives. This can be done before a user ever sees the original difficult text. This PhD focusses on the errors that arise during simplification. Novel evaluation measures are introduced. A variety of areas will benefit from automated simplification. The Pipeline “The protestor was arrested” Output Text Substitution Ranking 1) Protestor 2) Activist Sense Disambiguation Campaigner: Protestor, Activist, Advocate Substitution Generation Campaigner: Protestor, Activist, Advocate Complex Word Discovery The campaigner was. . . “The campaigner was arrested” Input Text • Simplest synonym selected. • Treated as a ranking task. • Decide which words will fit. • Must consider context. • Find suitable replacements. • Thesaurus look up. • Difficult words identified. • Depends on context. The Applications Usage Description Language Learners Easy to read material in target language. Stroke Victims Access to easy to read information pro- motes rehabilitation and self confidence. Medical Patients Improved access to medical information improves patient knowledge and care. Consumers Better understanding of technical legal language in licence agreements. Academics Support when reading material from outside of main discipline. Public Engagement Tools to help authors produce jargon free text for a lay audience. The Problem • Errors occur in the pipeline, affecting text quality. • Low text quality results in poor understandability. • The process can result in text being translated to nonsense. • My research has categorised the errors as follows: Type 2: A complex or a simple word may be assigned to the wrong category. Type 3: No substitutions which would result in a simplification of the target word are available. Type 4: Sense disambiguation error. The meaning of the sen- tence has changed significantly. Type 5: Ranking Error. A replacement which does not simplify the sentence has been selected. • In a recent study [2] I found the frequency of each error to be: 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % Type 2 Type 3 Type 4 Type 5 ErrorFrequency Error Code 65.03% 42.19% 29.73% 26.92% The Research Pipeline and Errors • A literature survey has identified focus areas [1]. • An error study has highlighted the importance of each area [2]. • Ongoing work will refine the error study. Complex Word Identification • The CW corpus has been developed using simple Wikipedia [3]. • Techniques to identify complex words have been evaluated [4]. Substitution Generation • Initial research has shown problems with traditional thesauri. • Thesaurus augmentation depends on the specific domain. Word Sense Disambiguation • Many systems exist for the task of disambiguation. • Several top disambiguation systems evaluated for simplification. • Research awaiting submission. Substitution Ranking • Depends heavily on the context and the user. • Research will look at the needs of individual users. Applications • Simplification will target academic literature. • Target audience will be lay readers. References [1] Shardlow, M. 2014. A Survey of Automated Text Simplification. IJACSA Spe- cial Issue on Natural Language Processing. [2] Shardlow, M. 2014. Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline. LREC, Reykjavik, Iceland, May. ELRA. [3] Shardlow, M. 2013. The CW Corpus: Evaluating the Identification of Complex Words. PITR, Sofia, Bulgaria, ACL. [4] Shardlow, M. 2013. A Comparison of Techniques to Automatically Identify Complex Words. ACL Student Research Workshop, Sofia, Bulgaria, ACL