SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings
Tomoya Mizumoto†, Yuta Hayashibe†, Mamoru Komachi†, Masaaki Nagata‡, Yuji Matsumoto†
†Nara Institute of Science and Technology, ‡NTT Communication Science Laboratories
Background
Discussion
System and Corpus
- A lot of previous works deal with one or a few restricted types of learners’ error
- It was not until recently that large scale learner corpora became widely
available for error correction
- little is known about the effect of learner corpus size in ESL error correction
1 http:www.gsk.or.jp/catalog/GSK2012-A/catalog_e.html 2 http://lang-8.com
- Use phrase-based SMT to conduct unrestricted error correction
- Several studies about grammatical error correction using phase-based SMT
- Use data from a language learning SNS Lang-82
- Language learners post their writing on the site to be corrected by native speaker
- Obtain pairs of learner’s and corrected sentence in large scale
- Crawled blog entries found in Lang-8 as of December 2010
- Used writings written by Japanese ESL learners (509,116 sentence pairs)
- Filtered noisy sentences -> 391,699 sentence pairs
Types % Types %
article 19.23 Lexical choice of noun 7.04
noun number 13.88 Lexical choice of verb 6.90
preposition 13.56 pronoun 6.62
tense 8.77 agreement 5.25
Table1.
Distribution of errors on KJ corpus1
preposition [Rozovskaya+], verb form [Lee+], tense [Tajiri+], spelling/article/preposition/word form [Park+]
Main Contribution
- Classify errors into two types
1. Get better correction by increasing corpus size (article, preposition, lexical choice)
2. Have little relationship with corpus size. (noun number, tense, pronoun, agreement)
- To see the effect of corpus size, we compare systems
- Using 1. Lang-8 with different size and 2. KJ (also used test data, 5-fold CV)
- Evaluation metrics (F-measure, precision, recall)
- P and R for each type of errors are calculated from tp, fp, fn based on error tags
learner:He talked to me his life of Kyoto, and he took me Kyoto.
correct:He talked to me about his life in Kyoto, and he took me to Kyoto.
system:He talked me his life on Kyoto, and he took me to Kyoto.
: I like a chocolate very much.
: I like chocolate very much.
: My cycle was injured, but I wasn’t.
: My bicycle was damaged, but I wasn’t.
learner
correct
: I read various type of books.
: I read various types of books.
learner
correct
: There is a big snoopy dools in my room.
: There is a big snoopy doll in my room.
learner
correct
: If I’ll live in Saitama, I must have …
: If I live in Saitama, I must have …
learner
correct
: The weather is very sunny, so we were …
: The weather was very sunny, so we were …
learner
correct
: Flowers is very beautiful.
: Flowers are very beautiful.
learner
correct
: I think, reading comics are not “reading”
: I think, reading comics is not “reading”
learner
correct
e.g. article e.g. lexical choice of noun
e.g. noun number
e.g. tense
e.g. agreement
*Spelling errors are excluded from
target of annotation in KJ corpus
Experiment and Results
learner
correct
0
0.1
0.2
0.3
0.4
0.5
0.6
KJ
2K
10K
20K
100K
200K
300K
390K
noun number tense
pronoun agreement
0
0.1
0.2
0.3
0.4
0.5
0.6
KJ
2K
10K
20K
100K
200K
300K
390K
article preposition
lexical choice of noun lexical choice of verb
[Brockett+ 06], [Mizumoto+ 11], [Ehsan+ 12]
Red circle indicate that the
difference of result using
Lang-8 and KJ is
statistically significant
(p<0.01)
Summary
- Show the effect of learner corpus size on the phrase-based SMT and its
advantages and disadvantage in grammatical error correction
- Increasing the size of learner corpus;
- improvement: article, preposition, lexical choice
- little improvement: noun number, agreement, tense
1. the first attempt to use large scale learner corpus to correct all types error
2. show the effect of learner corpus size on the phrase-based SMT
and its advantages and disadvantage
- Word “dools” is displaced from “a big”
- A proper noun “snoopy” is inserted
between “a big” and “dools”
* “dools” is also a spelling error
- System fails to find tense agreement in
the complex sentence
- Pattern is unseen
- no way to capture the relation between
subject “reading” and “are”
To solve, needs to conduct generalization using POS or consider dependency relation
To solve, involves global context [Tajiri+ 12]
To solve, needs to get the subject-verb relation considering a dependency structure
P=1/2,
R=1/2

Weitere ähnliche Inhalte

Ähnlich wie The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings @COLING2012

AIS EAL workshop notes Aug2011
AIS EAL workshop notes Aug2011AIS EAL workshop notes Aug2011
AIS EAL workshop notes Aug2011aissaigon
 
I would like to discuss my experience developing and implementing .docx
I would like to discuss my experience developing and implementing .docxI would like to discuss my experience developing and implementing .docx
I would like to discuss my experience developing and implementing .docxflorriezhamphrey3065
 
I would like to discuss my experience developing and implementing .docx
I would like to discuss my experience developing and implementing .docxI would like to discuss my experience developing and implementing .docx
I would like to discuss my experience developing and implementing .docxsheronlewthwaite
 
Combining General and Genre-Specific Approaches to L2 Writing Instruction
Combining General and Genre-Specific Approaches to L2 Writing InstructionCombining General and Genre-Specific Approaches to L2 Writing Instruction
Combining General and Genre-Specific Approaches to L2 Writing Instructionguest05424
 
Developing students' reading skills in science education
Developing students' reading skills in science educationDeveloping students' reading skills in science education
Developing students' reading skills in science educationStefaan Vande Walle
 
language and Content objectives
language and Content objectiveslanguage and Content objectives
language and Content objectivesazschnee
 
A Machine Learning Approach To The Effects Of Writing Task Prompts
A Machine Learning Approach To The Effects Of Writing Task PromptsA Machine Learning Approach To The Effects Of Writing Task Prompts
A Machine Learning Approach To The Effects Of Writing Task PromptsKayla Jones
 
Monitoring and feedback in the process of language acquisition analysis and ...
Monitoring and feedback in the process of language acquisition  analysis and ...Monitoring and feedback in the process of language acquisition  analysis and ...
Monitoring and feedback in the process of language acquisition analysis and ...ijnlc
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...IJECEIAES
 
Mc Call Presentation Lancaster.07
Mc Call Presentation Lancaster.07Mc Call Presentation Lancaster.07
Mc Call Presentation Lancaster.07Mats Deutschmann
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemIJERA Editor
 
Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...eSAT Publishing House
 
1 PHY 241 Fall 2018 PHY 241 Lab 6- Law of Conservation o.docx
1 PHY 241 Fall 2018 PHY 241 Lab 6- Law of Conservation o.docx1 PHY 241 Fall 2018 PHY 241 Lab 6- Law of Conservation o.docx
1 PHY 241 Fall 2018 PHY 241 Lab 6- Law of Conservation o.docxoswald1horne84988
 
Answer Key Booklet Contents
Answer Key Booklet ContentsAnswer Key Booklet Contents
Answer Key Booklet ContentsAndrew Parish
 
An exploratory research on grammar checking of Bangla sentences using statist...
An exploratory research on grammar checking of Bangla sentences using statist...An exploratory research on grammar checking of Bangla sentences using statist...
An exploratory research on grammar checking of Bangla sentences using statist...IJECEIAES
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
Guide bem-octobre-2017-exclusively
Guide bem-octobre-2017-exclusivelyGuide bem-octobre-2017-exclusively
Guide bem-octobre-2017-exclusivelyMr Bounab Samir
 

Ähnlich wie The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings @COLING2012 (20)

AIS EAL workshop notes Aug2011
AIS EAL workshop notes Aug2011AIS EAL workshop notes Aug2011
AIS EAL workshop notes Aug2011
 
I would like to discuss my experience developing and implementing .docx
I would like to discuss my experience developing and implementing .docxI would like to discuss my experience developing and implementing .docx
I would like to discuss my experience developing and implementing .docx
 
I would like to discuss my experience developing and implementing .docx
I would like to discuss my experience developing and implementing .docxI would like to discuss my experience developing and implementing .docx
I would like to discuss my experience developing and implementing .docx
 
Combining General and Genre-Specific Approaches to L2 Writing Instruction
Combining General and Genre-Specific Approaches to L2 Writing InstructionCombining General and Genre-Specific Approaches to L2 Writing Instruction
Combining General and Genre-Specific Approaches to L2 Writing Instruction
 
Developing students' reading skills in science education
Developing students' reading skills in science educationDeveloping students' reading skills in science education
Developing students' reading skills in science education
 
language and Content objectives
language and Content objectiveslanguage and Content objectives
language and Content objectives
 
A Machine Learning Approach To The Effects Of Writing Task Prompts
A Machine Learning Approach To The Effects Of Writing Task PromptsA Machine Learning Approach To The Effects Of Writing Task Prompts
A Machine Learning Approach To The Effects Of Writing Task Prompts
 
Monitoring and feedback in the process of language acquisition analysis and ...
Monitoring and feedback in the process of language acquisition  analysis and ...Monitoring and feedback in the process of language acquisition  analysis and ...
Monitoring and feedback in the process of language acquisition analysis and ...
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
 
Mc Call Presentation Lancaster.07
Mc Call Presentation Lancaster.07Mc Call Presentation Lancaster.07
Mc Call Presentation Lancaster.07
 
Examining reading
Examining readingExamining reading
Examining reading
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis System
 
Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...
 
1 PHY 241 Fall 2018 PHY 241 Lab 6- Law of Conservation o.docx
1 PHY 241 Fall 2018 PHY 241 Lab 6- Law of Conservation o.docx1 PHY 241 Fall 2018 PHY 241 Lab 6- Law of Conservation o.docx
1 PHY 241 Fall 2018 PHY 241 Lab 6- Law of Conservation o.docx
 
Answer Key Booklet Contents
Answer Key Booklet ContentsAnswer Key Booklet Contents
Answer Key Booklet Contents
 
An exploratory research on grammar checking of Bangla sentences using statist...
An exploratory research on grammar checking of Bangla sentences using statist...An exploratory research on grammar checking of Bangla sentences using statist...
An exploratory research on grammar checking of Bangla sentences using statist...
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Vědecké publikování v anglickém jazyce
Vědecké publikování v anglickém jazyceVědecké publikování v anglickém jazyce
Vědecké publikování v anglickém jazyce
 
Guide bem-octobre-2017-exclusively
Guide bem-octobre-2017-exclusivelyGuide bem-octobre-2017-exclusively
Guide bem-octobre-2017-exclusively
 

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings @COLING2012

  • 1. The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings Tomoya Mizumoto†, Yuta Hayashibe†, Mamoru Komachi†, Masaaki Nagata‡, Yuji Matsumoto† †Nara Institute of Science and Technology, ‡NTT Communication Science Laboratories Background Discussion System and Corpus - A lot of previous works deal with one or a few restricted types of learners’ error - It was not until recently that large scale learner corpora became widely available for error correction - little is known about the effect of learner corpus size in ESL error correction 1 http:www.gsk.or.jp/catalog/GSK2012-A/catalog_e.html 2 http://lang-8.com - Use phrase-based SMT to conduct unrestricted error correction - Several studies about grammatical error correction using phase-based SMT - Use data from a language learning SNS Lang-82 - Language learners post their writing on the site to be corrected by native speaker - Obtain pairs of learner’s and corrected sentence in large scale - Crawled blog entries found in Lang-8 as of December 2010 - Used writings written by Japanese ESL learners (509,116 sentence pairs) - Filtered noisy sentences -> 391,699 sentence pairs Types % Types % article 19.23 Lexical choice of noun 7.04 noun number 13.88 Lexical choice of verb 6.90 preposition 13.56 pronoun 6.62 tense 8.77 agreement 5.25 Table1. Distribution of errors on KJ corpus1 preposition [Rozovskaya+], verb form [Lee+], tense [Tajiri+], spelling/article/preposition/word form [Park+] Main Contribution - Classify errors into two types 1. Get better correction by increasing corpus size (article, preposition, lexical choice) 2. Have little relationship with corpus size. (noun number, tense, pronoun, agreement) - To see the effect of corpus size, we compare systems - Using 1. Lang-8 with different size and 2. KJ (also used test data, 5-fold CV) - Evaluation metrics (F-measure, precision, recall) - P and R for each type of errors are calculated from tp, fp, fn based on error tags learner:He talked to me his life of Kyoto, and he took me Kyoto. correct:He talked to me about his life in Kyoto, and he took me to Kyoto. system:He talked me his life on Kyoto, and he took me to Kyoto. : I like a chocolate very much. : I like chocolate very much. : My cycle was injured, but I wasn’t. : My bicycle was damaged, but I wasn’t. learner correct : I read various type of books. : I read various types of books. learner correct : There is a big snoopy dools in my room. : There is a big snoopy doll in my room. learner correct : If I’ll live in Saitama, I must have … : If I live in Saitama, I must have … learner correct : The weather is very sunny, so we were … : The weather was very sunny, so we were … learner correct : Flowers is very beautiful. : Flowers are very beautiful. learner correct : I think, reading comics are not “reading” : I think, reading comics is not “reading” learner correct e.g. article e.g. lexical choice of noun e.g. noun number e.g. tense e.g. agreement *Spelling errors are excluded from target of annotation in KJ corpus Experiment and Results learner correct 0 0.1 0.2 0.3 0.4 0.5 0.6 KJ 2K 10K 20K 100K 200K 300K 390K noun number tense pronoun agreement 0 0.1 0.2 0.3 0.4 0.5 0.6 KJ 2K 10K 20K 100K 200K 300K 390K article preposition lexical choice of noun lexical choice of verb [Brockett+ 06], [Mizumoto+ 11], [Ehsan+ 12] Red circle indicate that the difference of result using Lang-8 and KJ is statistically significant (p<0.01) Summary - Show the effect of learner corpus size on the phrase-based SMT and its advantages and disadvantage in grammatical error correction - Increasing the size of learner corpus; - improvement: article, preposition, lexical choice - little improvement: noun number, agreement, tense 1. the first attempt to use large scale learner corpus to correct all types error 2. show the effect of learner corpus size on the phrase-based SMT and its advantages and disadvantage - Word “dools” is displaced from “a big” - A proper noun “snoopy” is inserted between “a big” and “dools” * “dools” is also a spelling error - System fails to find tense agreement in the complex sentence - Pattern is unseen - no way to capture the relation between subject “reading” and “are” To solve, needs to conduct generalization using POS or consider dependency relation To solve, involves global context [Tajiri+ 12] To solve, needs to get the subject-verb relation considering a dependency structure P=1/2, R=1/2