2. Introduction
Overview of automated scoring approaches
Writing examples: E-rater, Criterion (ETS)
Oral examples: Versant, Versant Junior
(Pearson)
Impact on teaching: Benefits
Impact on teaching: Dangers
Thoughts for the future
3. What is automated scoring?
Computer software that automatically
assigns scores to writing or speaking
samples
Essays can be assigned scores instantly by
computer
Test takers can call a testing center and take
an oral test without speaking to a human
Scores reported instantly
Some level of feedback given to test takers
Variety of software approaches
4. How does a computer grade a test?
Approach #1: Natural Language Processing
(NLP)
Software identifies and counts linguistic features
Software does not attempt to gauge content in any
way
Used for testing writing
Approach #2: Speech Recognition
Software compares speech sample to a large
database of samples of the same test question(s)
Faster responses are “more fluent”, etc.
Used for testing speaking
5. Example1: E-Rater (ETS)
Automated scoring of timed essays
Uses NLP
Currently used in limited way to rate:
TOEFL
GRE
Used for formative assessment (Criterion,
Scoreitnow!, TOEFL Practice Online)
Individual assessment
Students turn in essays, receive scores, revise,
repeat
6. What does E-Rater do with an
essay?
Global measures:
Count total words, total sentences,
sentence length, # of paragraphs
Vocabulary measures:
# of unique words used ÷ total words (lexical
diversity)
# of low-frequency words (lexical depth)
# of prompt-specific words (topic appropriateness)
7. What does E-Rater do with an
essay? #2
Grammatical measures:
Dependent, independent clauses
Passive voice
Subject-verb agreement, etc.
Other measures:
Sequencing words (then, next, etc.)
Logical relations (as a result, however, etc.)
Mechanics (punctuation, etc.)
8. What is a “good” essay
according to E-rater?
Long (longer is always better for e-rater)
Standard structure
Longer sentences, many dependent clauses
Many explicit organizational words
Obscure vocabulary
Indubitably > Surely
Obfuscate profusely > Lie a lot
Wide range of vocabulary
9. What does E-rater not
notice? “Teaching assistants are paid an excessive
amount of money. The average teaching
assistant makes six times as much money as
college presidents. In addition, they often
receive a plethora of extra benefits such as
private jets, vacations in the south seas, a
staring roles in motion pictures. Moreover, in
the Dickens novel Great Expectation, Pip
makes his fortune by being a teaching
assistant.” (Perelman 2012)
10. Criterion: E-rater application
Designed for in-class use
Students’ essays are instantly scored using
E-rater software
Essays get individualized feedback on errors
and style
Students directed to materials for self-study
and revision of essay
Process repeated
Used in many schools worldwide
11.
12. Example2: Versant
(Pearson)
First fully automated oral language test used
commercially
Developed by Ordinate corp, bought by Pearson
Test is taken in computer lab using microphone
and headset, or over the telephone
Computer automatically rates the speech and
produces scores
Used widely in business, increasingly in schools
Many versions, multiple uses and languages
13. What is a Versant test like?
About 15 minutes long
Several simple task types:
Repeating sentences
Scrambled Sentences
“Oral multiple choice”
All responses totally scripted
Optional “Free response” final question
Not scored, but saved for reference
16. What does Versant do with speech?
Test takers’ speech is captured by a
microphone and processed in computer
server
This speech is “compared to” a large
database of human-scored responses
Native speakers from different countries
English learners from different countries, of all
different proficiency levels
Scores given in the range of “most similar”
responses to the test taker
Scores available immediately
17. What is a “good” Versant
response?
Fast response (fluency score)
Clear
Accurate (the sentence is repeated exactly,
etc.) (sentence mastery + vocabulary)
Native-like pronunciation (pronunciation score)
We talk about Global English nowadays!
“Comprehensibility” is more important than native-
like speech (Celce-Murcia, Brinton, & Goodwin 2010)
18. What does Versant NOT
measure?
Range of vocabulary
Extended speaking
Pragmatics, cultural awareness
Ability to interact with others
19. What are some advantages of
these systems?
Reliability
Computers do not get tired
Computers are not biased for or against
individuals
Scores are more consistent than with human
raters (Bernstein, Van Moere et al 2010)
Practicality
Automated scoring is much less expensive than
human rating
Scores and feedback obtained instantly
20. What does research show?
When test takers are acting “in good faith”,
scores are roughly equivalent of human raters
Bridgeman et al (2005): E-rater scores are similar to
humans for most nationalities
Bernstein, Van Moere, & Cheng (2010),Van Moere
(2012): Versant scores correspond closely to scores
from interview assessments
*Even though the final scores are very similar,
the tests do not actually measure the same
things (Chun 2006)
21. Problems with automated
scores Automated tests can be “gamed” or
tricked
Farnsworth (2013): Versant scores can
be quickly raised by coaching, but
similar results found with an interview
assessment
Monaghan & Bridgeman (2005): E-rater
scores cannot be used without human
raters for “real” testing (TOEFL, etc.)
22. How does this positively affect
teaching?
Writing feedback: Students may get
more (and faster) feedback on:
Grammar errors in writing
Lexical errors in writing
Oral feedback: Teachers may be able to
more often assess students’ speaking
skills
23. How might this negatively affect
teaching?
Washback: Effect of testing on instructional
practice (Wall 1999, Bachman & Palmer
1996)
Teachers tend to focus on what is tested
(Bailey 1999)
What is tested is different in automated
scoring
Mismatch between current ideas in
Communicative Language Teaching vs.
automated scoring
24. Effects on writing instruction
Increased focus on grammatical
accuracy and low-frequency vocabulary
Heavy focus on traditional essay
structure and devices
Decreased focus on quality of content,
selection of examples, style, etc.
“Use a lot of high level vocabulary, make
sentences longer, mimic conventional
thinking on the topic”
25. Effects on oral instruction
Increased focus on oral repetition, word-
level pronunciation
Increased focus on speed of response
Decreased focus on pragmatic / cultural
components of language
Decreased focus on critical thinking
26. Maybe this is a good thing?
Some argue that we should return to a
greater focus on structure, vocabulary, speed,
and pronunciation (Van Moere 2012a, 2012b)
Focus on grammatical forms, linguistic
structures certainly is beneficial
Students consistently express a desire for
direct instruction on fundamentals
27. Conclusion
Computer-scored testing is in all our
futures
Provides compelling practical benefits
Students benefit from frequent feedback on
grammar and vocabulary
Does not (cannot) measure the same
things as humans measure
Great danger of limiting instruction and
curriculum to grammar, vocabulary, speed,
and pronunciation