Computer generated feedback

Assessing Writing 19 (2014) 51–65
Contents lists available at ScienceDirect
Assessing Writing
The effects of computer-generated feedback on
the quality of writing
Marie Stevenson∗
, Aek Phakiti
University of Sydney, Australia
a r t i c l e i n f o
Article history:
Available online 17 December 2013
Keywords:
Automated writing evaluation (AWE)
Computer-generated feedback
Effects on writing quality
Critical review
a b s t r a c t
This study provides a critical review of research into the effects of
computer-generated feedback, known as automated writing evalu-
ation (AWE), on the quality of students’ writing. An initial research
survey revealed that only a relatively small number of studies have
been carried out and that most of these studies have examined the
effects of AWE feedback on measures of written production such as
scores and error frequencies. The critical review of the findings for
written production measures suggested that there is modest evi-
dence that AWE feedback has a positive effect on the quality of the
texts that students produce using AWE, and that as yet there is little
evidence that the effects of AWE transfer to more general improve-
ments in writing proficiency. Paucity of research, the mixed nature
of research findings, heterogeneity of participants, contexts and
designs, and methodological issues in some of the existing research
were identified as factors that limit our ability to draw firm con-
clusions concerning the effectiveness of AWE feedback. The study
provides recommendations for further AWE research, and in par-
ticular calls for more research that places emphasis on how AWE
can be integrated effectively in the classroom to support writing
instruction.
© 2013 Elsevier Ltd. All rights reserved.
1. Introduction
This study provides a critical review of literature on the pedagogical effectiveness of computer-
based educational technology for providing students with feedback on their writing that is commonly
∗ Corresponding author.
E-mail addresses: marie.stevenson@sydney.edu.au (M. Stevenson), aek.phakiti@sydney.edu.au (A. Phakiti).
1075-2935/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.asw.2013.11.007

52 M. Stevenson, A. Phakiti / Assessing Writing 19 (2014) 51–65
known as Automated Writing Evaluation (AWE).1 AWE software provides computer-generated feed-
back on the quality of written texts. A central component of AWE software is a scoring engine that
generates automated scores based on techniques such as artificial intelligence, natural language
processing and latent semantic analysis (See Dikli, 2006; Philips, 2007; Shermis & Burstein, 2003;
Yang, Buckendahl, Juszkiewicz, & Bhola, 2002). AWE software that is used for pedagogical purposes
also provides written feedback in the form of general comments, specific comments and/or corrections.
Originally, AWE was primarily used in high-stakes testing situations to generate summative scores
to be used for assessment purposes. Widely used, commercially available scoring engines are Project
Essay GraderTM (PEG), e-rater®, Intelligent Essay AssessorTM (IEA), and IntelliMetricTM. In recent
years, the use of AWE for the provision of formative feedback in the writing classroom has steadily
increased, particularly in classrooms in the United States. AWE programs are currently being used
in many elementary, high school, college and university classrooms with a range of writers from
diverse backgrounds. Examples of commercially available AWE programs designed for classroom use
are: Criterion (Educational Testing Service: MY Access! (Vantage Learning): Write to Learn and Sum-
mary Street (Pearson Knowledge Technologies); and Writing Roadmap (McGraw Hill). These programs
sometimes incorporate the same scoring engine as used in summative programs. For example, Crite-
rion incorporates the e-rater scoring engine and MY Access! incorporates the IntellimetricTM scoring
engine.
Common to all AWE programs designed for classroom use is that they provide writers with multiple
drafting opportunities, and upon receiving feedback writers can choose whether or not to use this
feedback to revise their texts. AWE programs vary in the kinds of feedback they provide writers. Some
provide feedback on both global writing skills and language use (e.g., Criterion, MY Access!), whereas
others focus on language use (e.g., QBL) and some claim to focus primarily on content knowledge (e.g.,
Write to Learn and Summary Street). Some programs incorporate other tools such as model essays,
scoring rubrics, graphic organizers, and dictionaries and thesauri.
Like many other forms of educational technology, the use of AWE in the classroom has been the
subject of controversy, with scholars taking divergent stances. On the one hand, AWE has been hailed
as a means of liberating instructors, freeing them up to devote valuable time to aspects of writing
instruction other than marking assignments (e.g., Burstein, Chodorow, & Leacock, 2004; Herrington
& Moran, 2001; Hyland & Hyland, 2006; Philips, 2007). It has been seen as impacting positively on
the quality of students’ writing, due to the immediacy of its ‘on-line’ feedback (Dikli, 2006), and the
multiple practice and revision opportunities it provides (Warschauer & Ware, 2006). It has also been
claimed to have positive effects on student autonomy (Chen & Cheng, 2008).
On the other hand, the notion that computers are capable of providing effective writing feedback has
aroused considerable suspicion, perhaps fueled by the fearful specter of a world in which humans are
replaced by machines. Criticisms have been made concerning the capacity of AWE to provide accurate
and meaningful scores (e.g., Anson, 2006; Freitag Ericsson, 2006). There is a common perception that
computers are not capable of scoring human texts, as they do not possess human inferencing skills and
background knowledge (Anson, 2006). Other criticisms relate to the effects that AWE has on students’
writing. AWE has been accused of reflecting and promoting a primarily formalist approach to writing,
in which writing is viewed as simply being “mastery of a set of subskills” (Hyland & Hyland, 2006,
p. 95). Comments generated by AWE have been said to place too much emphasis on surface features
of writing, such as grammatical correctness (Hyland & Hyland, 2006) and the effects of writing for a
non-human audience have been decried. There is also fear that using AWE feedback may be more of an
exercise in developing test-taking strategies than in developing writing skills, with students writing
to the test by consciously or unconsciously adjusting their writing to meet the criteria of the software
(Patterson, 2005).
Positive and negative claims regarding the effects of AWE on students’ writing are not always based
on empirical evidence, and at times appear to reflect authors’ own ‘techno-positivistic’ or ‘technopho-
bic’ stances toward technology in the writing classroom. Moreover, quite a lot of the research that has
1
Other terms found in the literature are automated essay evaluation (AEE) (See Shermis & Burstein, 2013) and writing
evaluation technology.

M. Stevenson, A. Phakiti / Assessing Writing 19 (2014) 51–65 53
been carried out is or from authors who have been involved in developing a particular AWE program
or who are affiliated with organizations that have developed these programs, so could contain a bias
toward showing AWE in a positive light. Consequently, there is lack of clarity concerning the current
state of evidence for the effects on the quality of students’ writing of AWE programs designed for
teaching and learning purposes.
However, it is important to be aware that over the past decades there has also been controversy
about the effects of teacher feedback on writing. Perhaps the strongest opponent of classroom writing
feedback was Truscott (1996), who claimed that feedback on grammar should be abandoned, as it
ignored deeper learning processes, only led to pseudo-learning and had a negative effect on the quality
of students’ writing. While most scholars have taken less extreme positions, in a review of issues
relating to feedback in the classroom, Hyland and Hyland (2006) concluded that there was surprisingly
little consensus about the kinds of feedback that are effective and in particular about the long term
effects of feedback on writing development. However, some research synthetic evidence exists for the
effectiveness of teacher feedback. In a recent met-analytic study, Biber, Nekrasova, and Horn (2011)
found that, when compared to no feedback, teacher feedback was associated with gains in writing
development for both first and second language writers. They found that a focus on content and
language use was more effective than focus on a focus on form only, especially for second language
writers. They also found that comments were more effective than error correction, even for improving
grammatical accuracy. It is therefore timely to evaluate whether there is evidence that computer-
generated feedback is also associated with improvements in writing.
To date, the thrust of AWE research has been on validation through the examination of the psy-
chometric properties of AWE scores by, for example, calculating the degree of correlation between
computer-generated scores and scores given by human raters. Studies have frequently found high
correlations between AWE scores and human scores. and these results have been taken as providing
evidence that AWE scores provide a psychometrically valid measure of students’ writing. (See two
volumes edited by Shermis and Burstein (2003, 2013) for detailed results and in-depth discussion of
the reliability and validity of specific AWE systems). Such studies, however, do not inform us about
whether AWE is effective as a classroom tool to actually improve students’ writing. As Warschauer
and Ware (2006) pointed out, while evidence of psychometric reliability and validity is a necessary
pre-requisite, it is not sufficient for understanding whether AWE ‘works’ in the sense of contributing
to positive outcomes for student learning. Even the recently published ‘Handbook of Automated Essay
Evaluation’ (Shermis & Burstein, 2013), although it pays some attention to AWE as a teaching and
learning tool, still has a strong psychometric and assessment focus.
Although a number of individual studies have examined the effects of AWE feedback in the
classroom, no comprehensive review of the literature exists that examines whether AWE feedback
improves the quality of students’ writing. Warschauer and Ware (2006) provided a thought-provoking
discussion of some existing research on AWE in the classroom and used this to make recommenda-
tions for future AWE research. However, they only provided a limited review that did not include all of
the then available research and did not provide an overview of the evidence for the effects of AWE on
students’ writing. Moreover, since their paper was written a number of studies have been published
in this area.
2. The current study
The current study provides an evaluation of the available evidence for the effects of AWE feedback
in the writing classroom in terms of written production. The study focuses on research involving AWE
systems specifically designed as tools for providing formative evaluation in the writing classroom,
rather than AWE systems designed to provide summative assessment in testing situations. The purpose
of formative evaluation is to provide writers with individual feedback that can form the basis for further
learning (Philips, 2007). In formative evaluation, there is a need to inform students not only about their
level of achievement, but also about their specific strengths and weaknesses. Formative evaluation can
be said to involve assessment for learning, rather than assessment of learning (Taylor, 2005). In this
study, feedback is viewed as encompassing both numeric feedback (i.e., scores and ratings) and written

feedback (i.e., global or specific comments on the quality of the text and/or identification of specific
problems in the actual text).
The study focuses on the effects of AWE on written production, because the capability to improve
the quality of students’ texts is central to claims made about the effectiveness of AWE feedback, and
because, likely as a consequence of this, the bulk of AWE pedagogical research focuses on written pro-
duction outcomes. The study includes AWE research on students from diverse backgrounds, in diverse
teaching contexts, and receiving diverse kinds of feedback from diverse AWE programs. The scope of
the research included is broad due to the relatively small number of existing studies and the hetero-
geneity of these studies. The study does not aim to make comparisons or draw conclusions about the
relative effects of AWE feedback on student writing for specific populations, contexts, feedback types
or programs. Instead, it aims to critically evaluate the effects of AWE feedback on written production
by identifying general patterns and trends, and identifying issues and factors that may impact on these
effects.
The study is divided into two stages: a research survey and a critical review. The objective of the
research survey is to determine the maturity of the research domain, and to provide a characterization
of the existing research that can be drawn on in the critical review. The objective of the critical review,
which is the central stage, is to identify overall patterns in the research findings and to evaluate and
interpret these findings, taking account of relevant issues and factors.
3. Method
3.1. The literature search
A comprehensive and systematic literature search was conducted to identify relevant primary
sources for inclusion in the research survey and critical review. Both published research (i.e., journal
articles, book chapters and reports) and unpublished research (i.e., theses and conference papers)
were identified.
The following means of identifying research were used:
a) Search engines: Google Scholar, Google.
b) Databases: ERIC, MLA, PsychInfo, SSCI, MLA, Ovid, PubPsych, Linguistics and Language Behavior
Abstracts (LLBA), Dissertation Abstracts International, Academic Search Elite, Expanded Academic,
ProQuest Dissertation and Theses Full-text, and Australian Education Index.
c) Search terms used: automated writing evaluation, automated writing feedback, computer-
generated feedback, computer feedback, and automated essay scoring automated evaluation,
electronic feedback, and program names (e.g., Criterion, Summary Street, Intelligent Essay Assessor,
Write to Learn, MY Access!).
d) Websites: ETS website (ets.org) (ETS Research Reports, TOEFL iBT Insight series, TOEFL iBT research
series, TOEFL Research Reports); AWE software websites.
e) Journals from 1990 to 2011: CAELL Journal; CALICO Journal; College English; English Journal; Com-
puter Assisted Language Learning; Computers and Composition; Educational Technology Research
and Development; English for Specific Purposes; IEEE Intelligent Systems; Journal of Basic Writ-
ing; Journal of Computer-Based Instruction; Journal of Educational Computing Research; Journal
of Research on Technology in Education; Journal of Second Language Writing; Journal of Technol-
ogy; Journal of Technology, Learning and Assessment, Language Learning and Technology; Language
Learning; Language Teaching Research; Learning, and Assessment; ReCALL; System; TESL-EJ.
f) Reference lists of already identified publications. In particular, the Ericson and Haswell (2006)
bibliography.
To be included, a primary source had to focus on empirical research on the use AWE feedback
generated by one or more commercially or non-commercially available programs for the formative
evaluation of texts in the writing classroom. The program reported on needed to provide text-
specific feedback. Studies were excluded that reported on programs that provided generic writing
guidelines (e.g., The Writing Partner: Zellermayer, Salomon, Globerson, & Givon, 1991; Essay Assist:

Chandrasegaran, Ellis, & Poedjosoedarmo, 2005). Studies that reported results already reported else-
where were also excluded. Where the same results were reported more than once, published studies
were chosen above unpublished ones, or if both were published, the first publication was chosen. This
led to the exclusion of Grimes (2005) and Kintsch et al. (2000).
Based on the above criteria, 33 primary sources were identified for inclusion in the research survey
(See Appendix A).
3.2. Coding of research survey
A coding scheme of study descriptors was developed for the research survey. The unit of coding
was the study. A study was defined as consisting of “a set of data collected under a single research plan
from a designated sample of respondents” (Lipsey & Wilson, 2001, p. 76). As one of the publications,
Elliot and Mikulas (2004), included four studies with different samples, this led to a total of 36 studies
being identified.
In order to obtain an overview of the scope of the research domain, the studies were first classified
in terms of constructs of effectiveness: Product, Process and Perceptions. Lai (2010) defined effec-
tiveness of AWE feedback in terms of three dimensions: (1) the effects on written production (e.g.,
quality scores, error frequencies and rates, lexical measures and text length); (2) the effects on writing
processes (e.g., rates and types of revisions, editing time, time on task, and rates of text production);
and (3) perceived usefulness. In our study, combinations of these constructs were possible, as some
studies included more than one construct.
Subsequently, as the focus of the study is writing outcomes, only studies that included Product
measurements were coded in terms of Substantive descriptors and Methodological descriptors (See
Lipsey & Wilson, 2001). Substantive descriptors relate to substantive aspects of the study, such as the
characteristics of the intervention and the research context. Methodological descriptors relate to the
methods and procedures used in the study. Table 1 lists the coding categories for both kinds of descrip-
tors and the coding options within each category. In the methodological descriptors, ‘Control group’
refers to whether the study included a control condition and whether this involved comparing AWE
feedback with a no feedback condition or with a teacher feedback condition. ‘Text’ refers to whether
outcomes were measured using texts for which AWE feedback had been received or other texts, such
as writing assessment tasks. ‘Outcome measure’ refers to the measure(s) of written production that
were included in the study.
The coding categories and options were developed inductively by reading through the sample
studies. Developing the coding scheme was a cyclical process, and each study was coded a number
of times, until the coding scheme was sufficiently refined. These coding cycles were carried out by
the first researcher. The reliability of the coding was checked through the coding of 12 studies (one
Table 1
Research survey coding scheme.
Categories Descriptors
Substantive descriptors
Publication type ISI-listed journal; non-ISI listed journal; book chapter; thesis; report;
unpublished paper
AWE program Open coding
Country Open coding
Educational context Elementary; high school; elementary/high school; university & college
Language background L1; L1 & ESL; EFL/ESL only; unspecified
Methodological descriptors
Design Between group; within-groups; between & within group; single group
Reporting Statistical testing; descriptive statistics; no statistics
Control group No feedback; teacher feedback; no feedback & teacher feedback;
different AWE conditions; no control group
Text AWE texts; other texts; AWE texts & other texts
Outcome measure Scores; scores & other product measures; errors; citations

Table 2
Research survey: constructs.
Construct Frequency
Product 17
Product & process 4
Product & perceptions 5
Product, process, & perceptions 4
Perceptions 5
Perceptions & process 1
Total 36
third of the data) by the second researcher. Rater reliability was calculated using Cohen’s kappa. For
the substantive descriptors the kappa values were all 1.00, except for language background, which
was .75. For the methodological descriptors the kappa values were .85 for Design, 1.00 for Reporting,
.85 for Control group, 1.00 for Text and .85 for Outcome measure. Any disagreements were resolved
through discussion.
For the research survey, the frequencies of the coding categories were collated and this information
was used to describe the characteristics of the studies in the research sample. For the critical literature
review, the findings of the sample studies were critically discussed in relation to the characteristics
of the studies identified in the research survey and also in relation to strengths or weaknesses of
particular studies.
3.3. Research survey
Table 2 shows that the primary focus of AWE research has so far been on the effects of AWE on
written production. Thirty of the thirty six studies include Product measures: 17 focus solely on Prod-
uct, and another 13 studies involve Product in combination with one or more of the other constructs.
The secondary focus has been on Perceptions, with five studies focusing solely on Perceptions, and
another 10 including Perceptions. No studies have focused solely on Process. In the remaining survey,
the thirty studies involving product measurements are characterized.
Table 3 shows that, in terms of types of publication, relatively few of the studies have appeared in
ISI-listed journals or in books. A number of the studies are from non-ISI-listed journals, and a number
are unpublished papers from conferences or websites. Table 3 also shows that 10 AWE programs are
involved in the sample and that the majority of these have been developed by organizations that are
major players in the field of educational technology: Criterion from ETS, MY Access! from Vantage
Learning, IEA and Summary Street from Pearson Knowledge Analysis Technologies. Criterion is the
program that has been examined most frequently.
Criterion, MY Access! and Writing Roadmap provide scores and feedback on both content and
language. However, one of the studies that examined Criterion (i.e., Chodorow, Gamon, & Tetreault,
2010) limited itself to examining feedback on article errors. Summary Street, IEA, LSA and ECS are all
Table 3
Publication, program and feedback.
Publication K Program K Feedback K
ISI-listed 7 Criterion 11 Content & language 20
Non-ISI-listed 7 My access 5 Content 5
Book chapter 1 Writing roadmap 1 Language 4
Thesis 5 ETIPS 2 Citations 1
Report 1 IEA 1
Unpublished paper 9 LSA semantic space 1
Summary street 3
ECS 1
SAIF 1
QBL 4

based on a technique known as latent sematic analysis that purports to focus primarily on content
feedback. ETIPS provides feedback for pre-service teachers on tasks carried out in an on-line case-based
learning environment. SAIF provides feedback on the citations in a text. QBL provides comments on
language errors only. The table shows that most of the studies have involved programs that provide
both content and language feedback.
Table 4 shows that the majority of studies were carried out in classrooms in the United States, with
the remaining studies being carried out in Asian countries, with the exception of a single study carried
out in Egypt. University and college contexts were the most common, followed by high school contexts,
and then elementary contexts. Almost half the studies do not specify the language background of the
participants. Among the studies that did report the language backgrounds of the participants, only two
of the studies (i.e., Chodorow et al., 2010; Choi, 2010) investigated the effects of language background
on the effects of AWE feedback as a variable. Chodorow et al. (2010) compared the effects of Criterion
feedback on the article errors of native and non-native speakers, and Choi (2010) compared the effects
of Criterion feedback on written production measures of EFL students in Korea and ESL students in
the U.S.
3.4. Methodological features
Table 5 shows that most of the studies involved statistical testing, and that between group designs,
in which one or more AWE conditions were compared with one or more control conditions, are the
most common design. There were also a number of within group comparisons in which the same
group of students was compared across drafts and/or texts. One study (i.e., Scharber, Dexter, & Riedel,
2008) used a single group design in which students’ ETIPS scores were correlated with the number of
drafts they submitted.
Table 5 also shows that the most common control group for the between group comparisons
involved a condition in which students received no feedback. In some cases, students in this con-
dition wrote the same texts as students in the experimental condition(s) but received no feedback on
them, and in other cases students in the control condition did not produce any experimental texts.
However, it is unclear in most of the studies whether students in the control condition did receive
some teacher feedback during their normal classroom instruction. Only three studies have explicitly
compared AWE feedback to teacher feedback.
In addition, the table shows that many of the studies have examined the effects of AWE feedback
on AWE texts. However, 11 of the studies focus partly or exclusively on the transfer effects of AWE to
the quality of texts that were not written using AWE.
Lastly, Table 5 shows that scores followed by errors are the most common writing production
measures that have been examined in the studies. Other measures that have been examined include
text length, sentence length, lexical measures and number of citations.
4. Critical review
The research survey has shown that the AWE pedagogical research domain is not a very mature
one. Even though written production has been the main focus of research to date, the total number
of studies carried out remains relatively small, and a number of these studies are either unpublished
papers or published in unranked journals, and perhaps as a consequence are lacking in rigor. Moreover,
these studies are highly heterogeneous, varying in terms of factors such as the AWE program that is
examined, the design of the study, and the educational context in which the studies were carried
out. Hence, not surprisingly, the research has produced mixed and sometimes contradictory results.
As a result, there is only modest evidence that AWE feedback has a positive effect on the quality of
students’ writing and, as the research survey showed, much of the available evidence relates to the
effectiveness of AWE in improving the quality of texts written using AWE feedback.
The evidence for the effects of AWE on writing quality from within group comparisons can be said
to be stronger than the evidence from between-group comparisons. In general, within-group studies
have shown that AWE scores increase and the number of errors decrease across AWE drafts and
texts produced by the same writers (e.g., Attali, 2004; Choi, 2010; El Ebyary & Windeatt, 2010; Foltz,

58M.Stevenson,A.Phakiti/AssessingWriting19(2014)51–65
Table 4
Country, context, language background and sample size.
Country K Educational context K Language background k Sample size K
USA 21 University & College 17 L1 1 <10 1
Taiwan 4 High school 8 Mixed 6 11–50 4
USA & Korea 1 Elementary 3 EFL 8 51–100 9
Japan 1 Elementary & High school 2 EFL &ESL 1 101–200 5
China 1 Unspeciﬁed 14 >200 10
Hong Kong 1 Unspeciﬁed 1
Egypt 1

M.Stevenson,A.Phakiti/AssessingWriting19(2014)51–6559
Table 5
Methodological features.
Design K Reporting K Control Text Outcome
Between groups 20 Statistical testing 23 No feedback 17 AWE text 19 Scores 13
Within groups 7 Descriptive statistics 3 Teacher feedback 3 Other text 9 Scores + other measures 11
Between & within 2 No statistics 4 No feedback & teacher feedback 1 Both AWE and other text 2 Errors 5
Single group 1 Different AWE conditions 1 Citations 1
No control 8

Laham, & Landaur, 1999; Shermis, Garvan, & Diao, 2008; Warden & Chen, 1995). This would appear
to indicate that writers are able to incorporate AWE feedback to improve the quality and accuracy of
AWE texts – at least according to the criteria that AWE programs use to evaluate texts. However, due
to methodological issues, some of the results of within-group studies need to be carefully interpreted.
To give an example, Attali (2004) excluded 71% of his data set from analysis because the writers did
not undertake any revising or redrafting. While the remaining students did on average increase their
score across drafts of the same texts, the lack of utilization of AWE by over two thirds of the cohort
at the very least places a question mark against the efficacy of AWE for stimulating students to revise
their texts. Moreover, an obvious limitation of within-group comparisons is that the lack of control
group makes it difficult to conclude with certainty that improvements are actually attributable to the
use of AWE software. Improvements made by students to successive drafts of a particular text could be
attributable to their own revising skills rather than to their use of revisions suggested by AWE feedback.
Improvements made to successive texts could possibly be attributable to other instructional factors
or possibly even to developmental factors.
The findings from between-group comparisons, which compare one or more AWE conditions with
one or more control conditions, are more mixed, and those findings that provide positive evidence
frequently suffer from serious methodological drawbacks. More than half the studies using between
groups comparisons showed either mixed effects or no effects for AWE feedback on writing outcomes.
Mixed effects involve effects being found for some texts but not for others (e.g., Riedel, Dexter, Scharber,
& Doering, 2006) for some measures but not for others (e.g., Rock, 2007), or for some groups of writers
and not for others (e.g., Schroeder, Grohe, & Pogue, 2008). In a number of cases, in their discussions
these studies largely ignore any negative evidence and hence draw conclusions about the effectiveness
of AWE that are more optimistic than appear to be warranted. For example, in a study by Schroeder
et al. (2008) on the effectiveness of Criterion in improving writing in a criminal justice writing course,
one of the three groups of students utilizing AWE feedback did not achieve significantly higher final
course grades than the control group. However, possible reasons for the non-significance of the results
for this third group are not mentioned and a very strong positive conclusion is drawn: “Results from
this study overwhelmingly point toward the value of technology when teaching writing skills” (p.
444). However, we did also find an example in which the authors did not appear to do justice to
their findings. Chodorow et al. (2010) found that Criterion reduced the article error rate of non-native
speakers, but not of native speakers. However, the study did not report the article error rates for the
native-speakers and does not raise the point that AWE may be less effective for native-speakers simply
because native-speakers do not tend to make many article errors. In this particular case, the lack of
a significant effect for native speakers should not be taken at face value as negative evidence for the
effectiveness of AWE.
A number of studies comparing AWE feedback to no feedback have found significant positive effects
for AWE on writing outcomes. For example, in a study by Franzke et al. (2005) on Summary Street
using a pre-test/posttest design with random assignment to an AWE or a no-feedback condition wrote,
students in both conditions wrote four texts, the quality of which were scored by human raters. It was
found that the AWE condition had higher holistic and content scores on both the averaged score for
the four texts and for orthogonal comparisons of the scores for the first two texts with the last two
texts. However, many of the studies are not as well-designed, and do not include a pretest or other
information on the comparability of students in experimental and control groups. In particular, results
of studies that have compared writing outcomes of students who received AWE with those of students
in previous cohorts should be viewed with caution. For example, Grimes (2008) found that in three
out of four schools students who used My Access had higher external test scores than students from a
previous year who did not receive AWE feedback. However, the author acknowledges that it is difficult
to attribute this improvement to AWE as during the intervention period important improvements to
the quality of writing instruction provided by teachers were also instituted.
As shown by the research survey, only three studies have explicitly compared AWE feedback with
teacher feedback (i.e., Frost, 2008; Rock, 2007; Warden, 2000). As the evidence from these studies is
also mixed, it seems premature to draw any firm conclusions. However, it should be pointed out that
none of the studies shows that AWE feedback is less effective than teacher feedback, which as such
could be taken as a positive sign. Nonetheless, of concern is that these studies report little about the

nature of the teacher feedback given or whether this feedback is not comparable to the AWE feedback.
For example, in Warden (2000), an AWE condition in which students received specific error feedback
is compared with a teacher feedback condition in which students received no specific feedback, but
only general comments on content, organization, and grammar. As students in the teacher feedback
condition received no specific feedback on the accuracy of their texts, it is hardly surprising that the
number of errors decreased more in the AWE condition.
In general, there appears to be more support for improvement of error rates than improvement of
holistic scores. For example, Kellogg, Whiteford, and Quinlan (2010) found that holistic scores did not
improve, but that errors were reduced. As the errors types that reduced largely related to linguistic
aspects of the text, they drew the conclusion that there was tentative support for learning about
mechanical aspects of writing from AWE. In contrast, Chen (1997) found that an AWE group and a no-
feedback control group decreased linguistic errors equally. However, the results of this study could
well be attributable to a methodological drawback, as both experimental and control groups were in
the same classes. In these classes, the teachers spent time reviewing the most common error types
found by the computer, in the presence of all the students. Hence, both groups of students may have
benefited from this instruction.
There appears to be no clear evidence as yet concerning whether AWE feedback is associated with
more generalized improvements in writing proficiency. Some of the studies that have examined trans-
fer of the effects of AWE to texts for which no AWE feedback has been provided found no significant
differences between scores for AWE and non-AWE conditions (i.e., Choi, 2010; Kellogg et al., 2010;
Shermis, Burstein, & Bliss, 2004). Moreover, although three studies did find evidence of transfer (Elliot
& Mikulas, 2004; Grimes, 2008; Wang & Wang, 2012), none of these studies is rigorously designed.
The Wang and Wang (2008) study had only one participant in each condition. The flaws in the Grimes
(2008) study have already been discussed. In Elliot and Mikulas (2004), in each of four sub-studies it
was claimed that AWE feedback was associated with better exam performance. However, there was
no random assignment to conditions and the reader is given no information concerning the char-
acteristics of the participants in the two conditions. In one of the sub-studies, students’ results are
compared with students from a year 2000 baseline. In addition, results for two of the four sub-studies
were not tested statistically, and those that were tested are tested non-parametrically. Also, some of
the claims seem to be rather remarkable, such as that a group who used MY Access! between February
and March of 2003 had a pass rate of 81% compared to only 46% for a group who did not receive AWE
feedback. It seems rather unlikely that such a short AWE intervention could lead to such a substantial
change in assessment outcomes, indicating that other factors may also be in operation.
However, it is important to be aware that one of the big unknowns of writing feedback received from
teachers is also whether it leads to any generalized improvements in students’ revising ability or in the
quality of their texts. Hyland and Hyland (2006) pointed out that research on human feedback rarely
looks beyond immediate correction in a subsequent draft, so research on AWE research is not alone
in neglecting this area. Closely connected to whether feedback can lead to generalized improvements
in writing is whether it assists students in developing their ability to revise independently. One of
the first steps in developing revising skills is that writers are able to notice aspects of their texts
that have not, up to that point, been salient (Schmidt, 1990; Truscott, 1998). Once a feature has been
noticed it becomes available for reflection and analysis. As Hyland and Hyland (2006) pointed out,
demonstrating that a student can utilize feedback to edit a draft tells us little about whether the
student has successfully acquired a feature. Similarly, it tells us little about whether the student has
developed the meta-cognitive skills to be able to notice, and then subsequently evaluate and correct
textual problems in other texts successfully.
Currently, we know little about whether AWE actually promotes independent revising. However,
there is some evidence that receiving AWE feedback may not actually encourage students to make
changes either between or within drafts. Attali (2004) reported that 71% of students did not redraft
their essays and 48% of those who did redraft did this only once. Grimes (2005) reported that a typical
revision pattern for students was to submit a first draft, correct a few mechanical errors and resubmit
as fast as possible to see if the score improved. Warden (2000) found that students who were offered
a redrafting opportunity after receiving AWE feedback from QBL actually spent significantly less time
revising their first drafts than students who received AWE feedback on a single draft with no redrafting

opportunity, or who received teacher feedback instead of AWE feedback. Students who received no
redrafting opportunity revised their texts before they received any feedback. They then submitted their
texts for marking, received a mark and AWE feedback, but were not given an opportunity to redraft
the text. In contrast, students who received AWE feedback and had an opportunity to redraft appeared
carried out little independent editing, instead waiting for the program to tell them what was wrong
with their texts and then specifically correcting these errors. While these students were successful in
correcting errors detected by AWE, they made few other changes to their texts. Moreover, this trend
continued across successive assignments, suggesting that AWE feedback was not leading to much
development in revising skills. However, it is important to remember that these findings corroborate
findings from revision research that writers – particularly younger writers-revise little and revise
superficially (Faigley & Witte, 1981; Whalen & Ménard, 1995). It may be that some students simply
do not possess the revising skills needed to allow them to benefit from the revision opportunities
afforded by AWE.
5. Conclusions and recommendations
This critical review suggests that there is only modest evidence that AWE feedback has a positive
effect on the quality of the texts that students produce using AWE, and that as yet there is little
clarity about whether AWE is associated with more general improvements in writing proficiency.
Paucity of research, heterogeneity of existing research, the mixed nature of research findings, and
methodological issues in some of the existing research are factors that limit our ability to draw firm
conclusions concerning the effectiveness of AWE feedback.
Initially, we endeavored to meta-analyze effect sizes for the product studies in this sample. How-
ever, due to methodological issues, many of the studies had to be excluded, leaving us with a very
small but still highly heterogenous sample. Heterogeneity necessitates the inclusion of moderator
analyses that examine the effects of variables such as AWE program, educational context and whether
AWE feedback was compared with no feedback or teacher feedback. However, with such a small sam-
ple, there was insufficient power to conduct moderator analyses. We felt that simply providing an
overall effect size that ignores possible effects of moderator variables was not a viable or meaningful
option. Instead, by carrying out a critical review we have been able to identify patterns in the existing
research as well as discussing gaps in the findings, and issues in the methodologies. Below are rec-
ommendations that follow from this review that can serve as a guideline for further research in this
area.
Although this review has not allowed us to differentiate the effectiveness of specific AWE programs,
given differences in the objectives of the programs and the nature of the feedback provided, it is likely
that such differences do exist. So far, more research on the effects of AWE has been carried out for
Criterion than for other programs. Therefore, more studies examining other programs are called for,
and in particular studies comparing the effectiveness of more than one AWE program.
A number of the studies provided only sketchy descriptions of their participants in terms of factors
such as SES, language background, literacy levels, and computer literacy. Future research needs to
be more rigorous in reporting participant characteristics, in controlling for participant variables and,
where appropriate, including these as variables in the research design. In particular, further research
is needed that examines the effectiveness of AWE feedback in ESL and EFL settings, and compares
these to L1 settings. Given the tremendous diversity of student populations within the United States,
not to mention the diversity in potential markets for AWE programs in both English-speaking and
EFL contexts outside the United States, it is of particular importance that the effectiveness of AWE
feedback for second language learners be investigated. The commercial programs in use in the United
States were not originally designed for English as a second language populations, even though they
are being marketed with such populations in mind (Warschauer & Ware, 2006).
In addition, further research examining the relative effects of AWE feedback and teacher feedback
is needed, in which greater explanation of the nature and quality of feedback provided by teachers
is given and in which it is ensured that the kinds of feedback offered by teachers and AWE programs
are more comparable. As there are so many factors in play, it is likely to turn out to be too simplis-
tic to make overall pronouncements about whether human feedback or computer feedback is better.

What needs to be disentangled is whether it is really is the source of the feedback that matters,
or whether it is other factors such as the way it is delivered, and the nature of the feedback pro-
vided that make the difference. It is also important to be aware that, as it is frequently reiterated by
developers and researchers alike that AWE feedback is intended to augment teacher feedback rather
than replace it (e.g., Chen & Cheng, 2008; Kellogg et al., 2010; Philips, 2007), research into the rel-
ative effects of different ways of integrating AWE feedback into classroom writing instruction may
have greater ecological validity. In a qualitative study involving the use of AWE feedback in three
classrooms, Chen and Cheng (2008) found indications that AWE feedback may be indeed be more
effective when it is combined with human feedback. However, this study did not examine the effects
of different methods of integration on written production. There are a variety of possible ways of
combining AWE with teacher feedback, and of scaffolding AWE feedback. To name just a couple, stu-
dents can use AWE to help them improve the quality of initial drafts and then submit to the teacher
for feedback, teachers can use AWE as a diagnostic tool for identifying the problems that students
have with their writing, and/or teachers can provide initial training. Research that investigates differ-
ent possibilities for integrating AWE into classroom writing instruction would also be of pedagogical
value.
Some might argue that in terms of the effectiveness of AWE feedback, the bottom line is whether
the scores it generates correlate with external assessment outcomes and whether its repeated use in
the classroom improves students’ test results. However, while it is highly desirable that the transfer of
effects to AWE feedback to non-AWE texts be established, it is questionable whether external exams
provide the most appropriate means of doing so. Firstly, as Warschauer and Ware (2006) remark,
exam writing is generally based on a single draft in timed circumstance, whereas the whole point of
AWE is that it encourages multiple drafting. Secondly, the scoring on exams may be too far removed
from the aspects for which AWE provides feedback. Thirdly, AWE feedback may not be robust enough
as an instructional intervention to impact noticeably on exam scores. Instead, we would recommend
examining transfer of the effects of AWE feedback in non-test situations using texts that are similar
in terms of genres and topics to the AWE texts students have been writing.
The question remains, of course, whether the kinds of writing that AWE feedback give writers the
opportunity to engage in actually reflect the kinds of writing that students do in their classrooms.
AWE programs generally offer only a limited number of genres, such as persuasive, narrative and
informative genres, though some programs such as My Access! additionally enable teachers to use
their own prompts (See Grimes & Warschauer, 2010). Moreover, as mentioned, AWE has been accused
of promoting formulaic writing with an unimaginative five-paragraph structure. The way lies open for
AWE research to include a greater consideration of genre by controlling for genre as a variable, and by
systematically examining the influence of genre on the effectiveness of AWE feedback, for example,
by comparing the effects of AWE when standard prompts are used with the effects when teachers’
own prompts are used.
In conclusion, this study has carried out a critical review of research that examines the effects
of formative AWE feedback on the quality of texts that students produce. It has illuminated what is
known and what is not known about the effects of AWE feedback on writing. It could be argued that
a limitation of the study is that it takes a narrow view of effectiveness in terms of a single dimension:
written production measures. It does not focus on either of the other two dimensions of effectiveness
identified by Lai (2010): the effects on writing processes or perceived usefulness. However, we feel that
Lai’s first dimension is an appropriate and valuable focal point for a critical review, because improving
students’ writing is central to the objectives of AWE and to claims regarding its effectiveness, both
of which are reflected in the fact that, as this study has shown, the bulk of research conducted so
far focuses on written production. We certainly do applaud AWE research that takes a triangulated
approach AWE by incorporating the effects of AWE on written production (product perspective), on
revision processes and learning and teaching processes (process perspective) and on writers’ and
teachers’ perceptions (perception perspective) (e.g., Choi, 2010; Grimes, 2008). We would also join
in the plea made by Liu et al. (2002) concerning research on computer-based technology: “rather
than focusing on the benefits and potentials of computer technology, research needs to move toward
explaining how computers can be used to support (second) language learning – i.e., what kind of tasks
or activities should be used and in what kinds of settings” (pp. 26–27). Consequently, as the next step,

in a follow-up study we will examine the use of AWE feedback in the classroom, including teaching
and learning processes and teacher and learner perceptions.
2Research survey sample
*Attali, Y. (2004). Exploring feedback and revision features of Criterion. Paper presented at the National Council on Measurement in
Education San Diego, April 12–16, 2004.
*Chen, J. F. (1997). Computer generated error feedback and writing process: A link [Electronic Version]. TESL-EJ, 2. Retrieved from
http://tesl-ej.org/ej07/a1.html
Chen, C. E., & Cheng, W. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning
effectiveness in EFL writing classes. Language Learning and Technology, 12(2), 94–112.
*Chodorow, M., Gamon, M., & Tetreault, J. (2010). The utility of article and preposition error correction systems for English
language learners: Feedback and assessment. Language Testing, 27(3), 419–436.
*Choi, J. (2010). The impact of automated essay scoring (AES) for improving English language learners essay writing. (Doctoral
dissertation. University of Virginia, 2010).
*El Ebyary, K., & Windeatt, S. (2010). The impact of computer-based feedback on students’ written work. International Journal
of English Studies, 10(2), 121–142.
*Elliot, S., & Mikulas, C. (2004). The impact of MY Access! ! Use on student writing performance: A technology overview and four
studies. Paper presented at the Annual Meeting of the American Educational Research Association.
*Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The intelligent essay assessor: Applications to educational technology. Interactive
Multimedia Educational Journal of Computer-Enhanced Learning, 1(2.). Retrieved from www.knowledge-technologies.com
*Franzke, M., Kintsch, E., Caccamise, D., & Johnson, N. (2005). Summary Street: Computer support for comprehension and writing.
Journal of Educational Computing Research, 33(1), 53–80.
*Frost, K. L. (2008). The effects of automated essay scoring as a high school classroom Intervention, PhD thesis. Las Vegas: University
of Nevada.
*Grimes, D. C. (2008). Middle school use of automated writing evaluation: A multi-site case study, PhD thesis. Irvine: University of
California.
*Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journal
of Technology, Language, and Assessment, 8(6), 1–43.
*Kellogg, R., Whiteford, A., & Quinlan, T. (2010). Does automated feedback help students learn to write? Journal of Educational
Computing Research, 42, 173–196.
Lai, Y.-H. (2010). Which do students prefer to evaluate their essays: Peers or computer program. British Journal of Educational
Technology, 41(3), 432–454.
*Riedel, E., Dexter, S. L., Scharber, C., & Doering, A. (2006). Experimental evidence on the effectiveness of automated essay scoring
in teacher education cases. Journal of Educational Computing Research, 35(3), 267–287.
*Rock, J. (2007). The impact of short-term use of Criterion on writing skills in 9th grade (Research Report RR-07-07). Princeton, NJ:
Educational Testing Service.
Scharber, C., Dexter, S., & Riedel, E. (2008). Students’ experiences with an automated essay scorer. The Journal of Technology,
Learning and Assessment, 7(1), 1–44.
*Shermis, M. D., Burstein, J., & Bliss, L. (2004). The impact of automated essay scoring on high stakes writing assessments. Paper
Presented at the Annual Meeting of the National Council on Measurement in Education.
*Shermis, M., Garvan, C. W., & Diao, Y. (2008). The impact of automated essay scoring on writing outcomes. Paper presented at the
Annual Meetings of the National Council on Measurement in Education, March 25–27, 2008.
*Schroeder, J. A., Grohe, B., & Pogue, R. (2008). The impact of criterion writing evaluation technology on criminal justice student
writing skills. Journal of Criminal Justice Education, 19(3), 432–445.
*Wang, F., & Wang, S. (2012). A comparative study on the inﬂuence of automated evaluation system and teacher grading on
students’ English writing. Procedia Engineering, 29, 993–997.
*Warden, C. A. (2000). EFL business writing behavior in differing feedback environments. Language Learning, 50(4), 573–616.
Warden, C. A., & Chen, J. F. (1995). Improving feedback while decreasing teacher burden in ROC ESL business English classes.
In P. Porythiaux, T. Boswood, & B. Babcock (Eds.), Explorations in English for professional communications. Hong Kong: City
University of Hong Kong.
Other references
Anson, C. M. (2006). Can’t touch this: Reﬂections on the servitude of computers as readers. Machine scoring of student essays.
In P. Freitag Ericsson, & R. Haswell (Eds.), Machine scoring of student essays (pp. 38–56). Logan, Utah: Utah State University
Press.
Biber, D., Nekrasova, T., & Horn, B. (2011). The effectiveness of feedback for L1-english and L2-writing development: A meta-analysis.
(ETS Research Report RR-11-05). Princeton, NJ: ETS.
Burstein, J., Chodorow, M., & Leacock, C. (2004). Automated essay evaluation: The Criterion online writing service. AI Magazine
(Fall), 27–36.
Chandrasegaran, A., Ellis, M., & Poedjosoedarmo, G. (2005). Essay assist: Developing software for writing skills improvement in
partnership with students. RELC Journal, 36(2), 137–155.
2
References marked with an asterisk indicate studies that examine solely or partially the effects of AWE on writing outcomes,
and which therefore have been included in the critical review.

Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment, 5(1), 1–35.
Faigley, L., & Witte, S. (1981). Analyzing revision. College Composition and Communication, 32, 400–414.
Freitag Ericsson, P. (2006). The meaning of meaning. In P. Freitag Ericsson, & R. Haswell (Eds.), Machine scoring of student essays.
Logan Utah: Utah State University Press.
Grimes, D. (2005). Assessing automated assessment: Essay evaluation software in the classroom. Paper presented at the Computers
and Writing Conference Stanford, CA.
Herrington, A., & Moran, C. (2001). What happens when machines read our students writing? College English, 63(4), 480–499.
Hyland, K., & Hyland, F. (2006). Feedback on second language students’ writing. Language Teaching, 39, 83–101.
Patterson, N. (2005). Computerized writing assessment: Technology gone wrong. Voices From the Middle, 13(2), 56–57.
Philips, S. M. (2007). Automated essay scoring: A literature review (SAEE research series #30). Kelowna, BC: Society for the
Advancement of Excellence Education.
Schmidt, R. W. (1990). The role of consciousness in second language learning. Applied Linguistics, 11(2), 129–158.
Shermis, M. D., & Burstein, J. (Eds.). (2003). Automated essay scoring: A cross-disciplinary perspective. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. New
York and London: Routledge.
Taylor, A. R. (2005). A future in the process of arrival: Using computer technologies for the assessment of learning. TASA Institute,
Society for the Advancement of Excellence in Education.
Truscott, J. (1996). The case against grammar correction in L2 writing classes. Language Learning, 46(2), 327–369.
Truscott, J. (1998). Noticing in second language acquisition: A critical review. Second Language Research, 14(2), 103–135.
Warschauer, M., & Ware, J. (2006). Automated writing evaluation: Deﬁning the classroom research agenda. Language Teaching
Research, 10(2), 1–24.
Whalen, K., & Ménard, N. (1995). L1 and L2 writers’ strategic and linguistic knowledge: A model of multiple-level discourse
processing. Language Learning, 44(3), 381–418.
Yang, Y., Buckendahl, C. W., Juszkewicz, P. J., & Bhola, D. S. (2002). A review of strategies for validating computer-automated
scoring. Applied Measurement in Education, 15(4), 391–412.
Zellermayer, M., Salomon, G., Globerson, T., & Givon, H. (1991). Enhancing writing-related metacognitions through a comput-
erized writing partner. American Educational Research Journal, 28(2), 373–391.
Further reading
*Britt, A., Wiemer-Hastings, P., Larson, A., & Perfetti, C. (2004). Using intelligent feedback to improve sourcing and integration
in students’ essays. International Journal of Artiﬁcial Intelligence in Education, 14, 359–374.
Dikli, S. (2007). Automated essay scoring in an ESL setting. (Doctoral dissertation, Florida State University, 2007).
*Hoon, T. (2006). Online automated essay assessment: Potentials for writing development. Retrieved from http://ausweb.
scu.edu.au/aw06/papers/refereed/tan3/paper.html
*Lee, C., Wong, K. C. K., Cheung, W. K., & Lee, F. S. L. (2009). Web-based essay critiquing system and EFL students’ writing: A
quantitative and qualitative investigation. Computer Assisted Language Learning, 22(1), 57–72.
*Matsumoto, K., & Akahori, K. (2008). Evaluation of the use of automated writing assessment software. In C. Bonk, C. Bonk, et al.
(Eds.), Proceedings of World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education 2008 (pp.
1827–1832). Chesapeake, VA: AACE.
*Schreiner, M. E. (2002). The role of automatic feedback in the summarization of narrative text, PhD Thesis. University of Colorado.
*Steinhart, D. J. (2001). An intelligent tutoring system for improving student writing through the use of latent semantic analysis.
Boulder: University of Colorado.
Wade-Stein, D., & Kintsch, E. (2004). Summary Street: Interactive computer support for writing. Cognition and Instruction, 22(3),
333–362.
Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom. Pedagogies: An International Journal, 3,
22–36.
Yao, Y. C., & Warden, C. A. (1996). Process writing and computer correction: Happy wedding or shotgun marriage? [Electronic
Version]. CALL Electronic Journal from Available at http://www.lerc.ritsumei.ac.jp/callej/1-1/Warden1.html.

Computer generated feedback

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Computer generated feedback

Ähnlich wie Computer generated feedback (20)

Mehr von Magdy Mahdy

Mehr von Magdy Mahdy (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Computer generated feedback