SlideShare a Scribd company logo
1 of 38
Download to read offline
The CW Corpus
A new resource for evaluating the
identification of complex words
Matthew Shardlow
The University of Manchester

http://lexicalsimplification.blogspot.co.uk

1
Lexical Simplification
Complex Word
Identification

http://lexicalsimplification.blogspot.co.uk

He profoundly changed.

2
Lexical Simplification
Complex Word
Identification

Substitution
Generation

http://lexicalsimplification.blogspot.co.uk

He profoundly changed.
Profoundly: extremely, very,
deeply, acutely

2
Lexical Simplification
Complex Word
Identification

He profoundly changed.
Profoundly: extremely, very,
deeply, acutely

Word Sense
Disambiguation

Profoundly: extremely, very,
deeply, acutely

`

Substitution
Generation

http://lexicalsimplification.blogspot.co.uk

2
Lexical Simplification
Complex Word
Identification

He profoundly changed.

Substitution
Generation

Profoundly: extremely, very,
deeply, acutely

Word Sense
Disambiguation

Profoundly: extremely, very,
deeply, acutely

Synonym
Ranking
http://lexicalsimplification.blogspot.co.uk

#1) deeply
#2) extremely
#3) acutely
2
Complex Words
●

How do we define a Complex Word?

http://lexicalsimplification.blogspot.co.uk

3
Complex Words
●

How do we define a Complex Word?

●

Manual Definition
–

Any word which impedes a reader's comprehension
of a text.

http://lexicalsimplification.blogspot.co.uk

3
Complex Words
●

How do we define a Complex Word?

●

Manual Definition
–

●

Any word which impedes a reader's comprehension
of a text.

Heuristic Features
–

Frequency

–

Familiarity

–

Length

–

Context

http://lexicalsimplification.blogspot.co.uk

3
Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed

He deeply changed

Incorrect:
He profoundly changed

He profoundly turned

http://lexicalsimplification.blogspot.co.uk

4
Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed
Incorrect:
He profoundly changed

●

He deeply changed

He profoundly turned

No evaluation data.

http://lexicalsimplification.blogspot.co.uk

4
Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed

He deeply changed

Incorrect:
He profoundly changed

He profoundly turned

●

No evaluation data.

●

Gold standard data required.

http://lexicalsimplification.blogspot.co.uk

4
Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.

http://lexicalsimplification.blogspot.co.uk

5
Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.

●

Difficult to generate automatically.

●

Expensive to generate manually.

http://lexicalsimplification.blogspot.co.uk

5
Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.

●

Difficult to generate automatically.

●

Expensive to generate manually.

●

So, we mine Simple Wikipedia Edit Histories.

http://lexicalsimplification.blogspot.co.uk

5
Simple Wikipedia
Edit Histories
●

Simple Wikipedia is:
–

An online encyclopedia.

–

Written in simplified English.

–

Collaboratively edited.

–

Available to download in XML format.

http://lexicalsimplification.blogspot.co.uk

6
Simple Wikipedia
Edit Histories
●

Simple Wikipedia is:
–

An online encyclopedia.

–

Written in simplified English.

–

Collaboratively edited.

–

Available to download in XML format.

●

Changes to articles recorded in edit histories.

●

Some changes are simplifications.

http://lexicalsimplification.blogspot.co.uk

6
Simple Wikipedia
Edit Histories
●

Advantages:
–

Fully automated

–

High throughput

–

Cost-effective

http://lexicalsimplification.blogspot.co.uk

7
Simple Wikipedia
Edit Histories
●

Advantages:

●

Disadvantages:

–

Fully automated

–

Content quality

–

High throughput

–

–

Cost-effective

Sparsity of
simplifications

–

Data exhaustion

http://lexicalsimplification.blogspot.co.uk

7
Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.

http://lexicalsimplification.blogspot.co.uk

8
Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.

–

A similarity score (TF-IDF) is calculated at sentence
level.

http://lexicalsimplification.blogspot.co.uk

8
Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.

–

A similarity score (TF-IDF) is calculated at sentence
level.

–

High scoring pairs passed on.

–

All other pairs discarded.

http://lexicalsimplification.blogspot.co.uk

8
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

–

Different stems.

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

–

Different stems.

–

Synonyms.

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

–

Different stems.

–

Synonyms.

–

Simplifying.

http://lexicalsimplification.blogspot.co.uk

9
Analysis
●

Six Annotators

●

Each given a 70 instance sample.

http://lexicalsimplification.blogspot.co.uk

10
Analysis
●

Six Annotators

●

Each given a 70 instance sample.
–

50 examples from the corpus (different for each).

–

20 common examples as a validation set.

http://lexicalsimplification.blogspot.co.uk

10
Analysis
●

Six Annotators

●

Each given a 70 instance sample.
–

50 examples from the corpus (different for each).

–

20 common examples as a validation set.

●

2 annotators ruled out by validation set.

●

Final corpus accuracy of: 97.5%.

http://lexicalsimplification.blogspot.co.uk

10
Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.

http://lexicalsimplification.blogspot.co.uk

11
Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.

●

3 techniques for identification were compared.

http://lexicalsimplification.blogspot.co.uk

11
Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.

●

3 techniques for identification were compared.

●

Sophisticated strategies gave little or no
improvement over a baseline.

http://lexicalsimplification.blogspot.co.uk

11
Summary
●

Identifying Complex Words is important.

●

The CW Corpus lets us evaluate methods.

●

Preliminary results give little improvement.

http://lexicalsimplification.blogspot.co.uk
References
●

Corpus: http://tinyurl.com/cwcorpus

S. Devlin and J. Tait. The use of a psycholinguistic
database in the simplif cation of text for aphasic readers.
i
Linguistic Databases, p 161–173, 1998.
M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and L. Lee.
For the sake of simplicity: unsupervised extraction of
lexical simplif cations from Wikipedia. In HLT ’10 NAACL,
i
p 365–368, Stroudsburg, PA, USA, 2010.
http://lexicalsimplification.blogspot.co.uk

12
Any Questions
●

Corpus: http://tinyurl.com/cwcorpus

http://lexicalsimplification.blogspot.co.uk

13
Annotator Agreement
Annotator
Index
1

Kappa
1

Sample
Accuracy
98%

2

1

96%

3

0.4

70%

4

1

100%

5

0.6

84%

6

1

96%

http://lexicalsimplification.blogspot.co.uk
Example Discarded
Pairs
●

It was a _____ evening.

●

Nonsense Words (spelling correction)
–

●

Different Stems (sense correction)
–

●

Cooler → Cool

Synonymy (meaning change)
–

●

Cuol → Cool

Long → Cool

Simplifying

– Calm → Cool
http://lexicalsimplification.blogspot.co.uk

More Related Content

Similar to The CW Corpus PITR2013

Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
Samuel Lampa
 
Food Chains and Food Webs
Food Chains and Food WebsFood Chains and Food Webs
Food Chains and Food Webs
sth215
 

Similar to The CW Corpus PITR2013 (13)

Sattose 2020 presentation
Sattose 2020 presentationSattose 2020 presentation
Sattose 2020 presentation
 
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Better Ruby Through Design Principles
Better Ruby Through Design PrinciplesBetter Ruby Through Design Principles
Better Ruby Through Design Principles
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
 
SEppt
SEpptSEppt
SEppt
 
Food Chains and Food Webs
Food Chains and Food WebsFood Chains and Food Webs
Food Chains and Food Webs
 
111.docx
111.docx111.docx
111.docx
 
Apache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York TimesApache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York Times
 
Technical writing
Technical writingTechnical writing
Technical writing
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Well test analysis
Well test analysisWell test analysis
Well test analysis
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

The CW Corpus PITR2013