The CW Corpus PITR2013

The CW Corpus
A new resource for evaluating the
identification of complex words
Matthew Shardlow
The University of Manchester

http://lexicalsimplification.blogspot.co.uk

1

Lexical Simplification
Complex Word
Identification


He profoundly changed.

2

Complex Word
Identification

Substitution
Generation


Profoundly: extremely, very,
deeply, acutely

2

Complex Word
Identification

deeply, acutely

Word Sense
Disambiguation

deeply, acutely

`

Substitution
Generation


2

Complex Word
Identification


Substitution
Generation

deeply, acutely

Word Sense
Disambiguation

deeply, acutely

Synonym
Ranking

#1) deeply
#2) extremely
#3) acutely
2

Complex Words
●

How do we define a Complex Word?


3

Complex Words
●


●

Manual Definition
–

Any word which impedes a reader's comprehension
of a text.


3

Complex Words
●


●

Manual Definition
–

●

Any word which impedes a reader's comprehension
of a text.

Heuristic Features
–

Frequency

–

Familiarity

–

Length

–

Context


3

Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed

He deeply changed

Incorrect:

He profoundly turned


4

Complex Word
Identification
●

Correct:
Incorrect:

●

He deeply changed


No evaluation data.


4

Complex Word
Identification
●

Correct:

He deeply changed

Incorrect:


●

No evaluation data.

●

Gold standard data required.


4

Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.


5

Gold Standard Data
●

–


–

Coherent English.

–


●

Difficult to generate automatically.

●

Expensive to generate manually.


5

Gold Standard Data
●

–


–

Coherent English.

–


●

Difficult to generate automatically.

●

Expensive to generate manually.

●

So, we mine Simple Wikipedia Edit Histories.


5

Simple Wikipedia
Edit Histories
●

Simple Wikipedia is:
–

An online encyclopedia.

–

Written in simplified English.

–

Collaboratively edited.

–

Available to download in XML format.


6

Simple Wikipedia
Edit Histories
●

Simple Wikipedia is:
–

An online encyclopedia.

–

Written in simplified English.

–

Collaboratively edited.

–

Available to download in XML format.

●

Changes to articles recorded in edit histories.

●

Some changes are simplifications.


6

Simple Wikipedia
Edit Histories
●

Advantages:
–

Fully automated

–

High throughput

–

Cost-effective


7

Simple Wikipedia
Edit Histories
●

Advantages:

●

Disadvantages:

–

Fully automated

–

Content quality

–

High throughput

–

–

Cost-effective

Sparsity of
simplifications

–

Data exhaustion


7

Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.


8

Candidates
●


●

Stage 1:
–


–

A similarity score (TF-IDF) is calculated at sentence
level.


8

Candidates
●


●

Stage 1:
–


–

A similarity score (TF-IDF) is calculated at sentence
level.

–

High scoring pairs passed on.

–

All other pairs discarded.


8

Mining – Validate
Candidates
●


●

Stage 2: A series of checks


9

Mining – Validate
Candidates
●


●

–

One word difference.


9

Mining – Validate
Candidates
●


●

–


–

Real words. (not: spam / vandalism / nonsense)


9

Mining – Validate
Candidates
●


●

–


–


–

Different stems.


9

Mining – Validate
Candidates
●


●

–


–


–

Different stems.

–

Synonyms.


9

Mining – Validate
Candidates
●


●

–


–


–

Different stems.

–

Synonyms.

–

Simplifying.


9

Analysis
●

Six Annotators

●

Each given a 70 instance sample.


10

Analysis
●

Six Annotators

●

–

50 examples from the corpus (different for each).

–

20 common examples as a validation set.


10

Analysis
●

Six Annotators

●

–

50 examples from the corpus (different for each).

–

20 common examples as a validation set.

●

2 annotators ruled out by validation set.

●

Final corpus accuracy of: 97.5%.


10

Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.


11

Experiments
●


●


●

3 techniques for identification were compared.


11

Experiments
●


●


●

3 techniques for identification were compared.

●

Sophisticated strategies gave little or no
improvement over a baseline.


11

Summary
●

Identifying Complex Words is important.

●

The CW Corpus lets us evaluate methods.

●

Preliminary results give little improvement.


References
●

Corpus: http://tinyurl.com/cwcorpus

S. Devlin and J. Tait. The use of a psycholinguistic
database in the simplif cation of text for aphasic readers.
i
Linguistic Databases, p 161–173, 1998.
M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and L. Lee.
For the sake of simplicity: unsupervised extraction of
lexical simplif cations from Wikipedia. In HLT ’10 NAACL,
i
p 365–368, Stroudsburg, PA, USA, 2010.

12

Any Questions
●

Corpus: http://tinyurl.com/cwcorpus


13

Annotator Agreement
Annotator
Index
1

Kappa
1

Sample
Accuracy
98%

2

1

96%

3

0.4

70%

4

1

100%

5

0.6

84%

6

1

96%


Example Discarded
Pairs
●

It was a _____ evening.

●

Nonsense Words (spelling correction)
–

●

Different Stems (sense correction)
–

●

Cooler → Cool

Synonymy (meaning change)
–

●

Cuol → Cool

Long → Cool

Simplifying

– Calm → Cool

The CW Corpus PITR2013

Recommended

Recommended

More Related Content

Similar to The CW Corpus PITR2013

Similar to The CW Corpus PITR2013 (13)

Recently uploaded

Recently uploaded (20)

The CW Corpus PITR2013