The CW Corpus is a medium sized resource containing examples of in-context difficult words along with suggested simplifications. It was produced by mining Simple Wikipedia edit histories for instances of simplification. This talk was given at the The Second Workshop on Predicting and Improving Text Readability for Target Reader Populations in Sofia, Bulgaria 2013. An associated paper is available at: http://aclweb.org/anthology/W/W13/W13-2908.pdf
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
The CW Corpus PITR2013
1. The CW Corpus
A new resource for evaluating the
identification of complex words
Matthew Shardlow
The University of Manchester
http://lexicalsimplification.blogspot.co.uk
1
4. Lexical Simplification
Complex Word
Identification
He profoundly changed.
Profoundly: extremely, very,
deeply, acutely
Word Sense
Disambiguation
Profoundly: extremely, very,
deeply, acutely
`
Substitution
Generation
http://lexicalsimplification.blogspot.co.uk
2
5. Lexical Simplification
Complex Word
Identification
He profoundly changed.
Substitution
Generation
Profoundly: extremely, very,
deeply, acutely
Word Sense
Disambiguation
Profoundly: extremely, very,
deeply, acutely
Synonym
Ranking
http://lexicalsimplification.blogspot.co.uk
#1) deeply
#2) extremely
#3) acutely
2
6. Complex Words
●
How do we define a Complex Word?
http://lexicalsimplification.blogspot.co.uk
3
7. Complex Words
●
How do we define a Complex Word?
●
Manual Definition
–
Any word which impedes a reader's comprehension
of a text.
http://lexicalsimplification.blogspot.co.uk
3
8. Complex Words
●
How do we define a Complex Word?
●
Manual Definition
–
●
Any word which impedes a reader's comprehension
of a text.
Heuristic Features
–
Frequency
–
Familiarity
–
Length
–
Context
http://lexicalsimplification.blogspot.co.uk
3
9. Complex Word
Identification
●
Important to get it right: Propagation errors
Correct:
He profoundly changed
He deeply changed
Incorrect:
He profoundly changed
He profoundly turned
http://lexicalsimplification.blogspot.co.uk
4
10. Complex Word
Identification
●
Important to get it right: Propagation errors
Correct:
He profoundly changed
Incorrect:
He profoundly changed
●
He deeply changed
He profoundly turned
No evaluation data.
http://lexicalsimplification.blogspot.co.uk
4
11. Complex Word
Identification
●
Important to get it right: Propagation errors
Correct:
He profoundly changed
He deeply changed
Incorrect:
He profoundly changed
He profoundly turned
●
No evaluation data.
●
Gold standard data required.
http://lexicalsimplification.blogspot.co.uk
4
12. Gold Standard Data
●
Criteria for corpus entries:
–
Annotated Sentences.
–
Coherent English.
–
One complex word per sentence.
http://lexicalsimplification.blogspot.co.uk
5
13. Gold Standard Data
●
Criteria for corpus entries:
–
Annotated Sentences.
–
Coherent English.
–
One complex word per sentence.
●
Difficult to generate automatically.
●
Expensive to generate manually.
http://lexicalsimplification.blogspot.co.uk
5
14. Gold Standard Data
●
Criteria for corpus entries:
–
Annotated Sentences.
–
Coherent English.
–
One complex word per sentence.
●
Difficult to generate automatically.
●
Expensive to generate manually.
●
So, we mine Simple Wikipedia Edit Histories.
http://lexicalsimplification.blogspot.co.uk
5
15. Simple Wikipedia
Edit Histories
●
Simple Wikipedia is:
–
An online encyclopedia.
–
Written in simplified English.
–
Collaboratively edited.
–
Available to download in XML format.
http://lexicalsimplification.blogspot.co.uk
6
16. Simple Wikipedia
Edit Histories
●
Simple Wikipedia is:
–
An online encyclopedia.
–
Written in simplified English.
–
Collaboratively edited.
–
Available to download in XML format.
●
Changes to articles recorded in edit histories.
●
Some changes are simplifications.
http://lexicalsimplification.blogspot.co.uk
6
19. Mining – Extract Likely
Candidates
●
There are 2 stages to the mining process.
●
Stage 1:
–
2 adjacent revisions are selected.
http://lexicalsimplification.blogspot.co.uk
8
20. Mining – Extract Likely
Candidates
●
There are 2 stages to the mining process.
●
Stage 1:
–
2 adjacent revisions are selected.
–
A similarity score (TF-IDF) is calculated at sentence
level.
http://lexicalsimplification.blogspot.co.uk
8
21. Mining – Extract Likely
Candidates
●
There are 2 stages to the mining process.
●
Stage 1:
–
2 adjacent revisions are selected.
–
A similarity score (TF-IDF) is calculated at sentence
level.
–
High scoring pairs passed on.
–
All other pairs discarded.
http://lexicalsimplification.blogspot.co.uk
8
22. Mining – Validate
Candidates
●
There are 2 stages to the mining process.
●
Stage 2: A series of checks
http://lexicalsimplification.blogspot.co.uk
9
23. Mining – Validate
Candidates
●
There are 2 stages to the mining process.
●
Stage 2: A series of checks
–
One word difference.
http://lexicalsimplification.blogspot.co.uk
9
24. Mining – Validate
Candidates
●
There are 2 stages to the mining process.
●
Stage 2: A series of checks
–
One word difference.
–
Real words. (not: spam / vandalism / nonsense)
http://lexicalsimplification.blogspot.co.uk
9
25. Mining – Validate
Candidates
●
There are 2 stages to the mining process.
●
Stage 2: A series of checks
–
One word difference.
–
Real words. (not: spam / vandalism / nonsense)
–
Different stems.
http://lexicalsimplification.blogspot.co.uk
9
26. Mining – Validate
Candidates
●
There are 2 stages to the mining process.
●
Stage 2: A series of checks
–
One word difference.
–
Real words. (not: spam / vandalism / nonsense)
–
Different stems.
–
Synonyms.
http://lexicalsimplification.blogspot.co.uk
9
27. Mining – Validate
Candidates
●
There are 2 stages to the mining process.
●
Stage 2: A series of checks
–
One word difference.
–
Real words. (not: spam / vandalism / nonsense)
–
Different stems.
–
Synonyms.
–
Simplifying.
http://lexicalsimplification.blogspot.co.uk
9
29. Analysis
●
Six Annotators
●
Each given a 70 instance sample.
–
50 examples from the corpus (different for each).
–
20 common examples as a validation set.
http://lexicalsimplification.blogspot.co.uk
10
30. Analysis
●
Six Annotators
●
Each given a 70 instance sample.
–
50 examples from the corpus (different for each).
–
20 common examples as a validation set.
●
2 annotators ruled out by validation set.
●
Final corpus accuracy of: 97.5%.
http://lexicalsimplification.blogspot.co.uk
10
32. Experiments
●
Several experiments performed so far.
●
Presented at ACL Student Research Workshop.
●
3 techniques for identification were compared.
http://lexicalsimplification.blogspot.co.uk
11
33. Experiments
●
Several experiments performed so far.
●
Presented at ACL Student Research Workshop.
●
3 techniques for identification were compared.
●
Sophisticated strategies gave little or no
improvement over a baseline.
http://lexicalsimplification.blogspot.co.uk
11
34. Summary
●
Identifying Complex Words is important.
●
The CW Corpus lets us evaluate methods.
●
Preliminary results give little improvement.
http://lexicalsimplification.blogspot.co.uk
35. References
●
Corpus: http://tinyurl.com/cwcorpus
S. Devlin and J. Tait. The use of a psycholinguistic
database in the simplif cation of text for aphasic readers.
i
Linguistic Databases, p 161–173, 1998.
M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and L. Lee.
For the sake of simplicity: unsupervised extraction of
lexical simplif cations from Wikipedia. In HLT ’10 NAACL,
i
p 365–368, Stroudsburg, PA, USA, 2010.
http://lexicalsimplification.blogspot.co.uk
12