Automated Language Assessment Scoring and impact on instruction

Vietnam TESOL 8
July 2013
Timothy Farnsworth, CUNY Hunter College

Introduction
 Overview of automated scoring approaches
 Writing examples: E-rater, Criterion (ETS)
 Oral examples: Versant, Versant Junior
(Pearson)
 Impact on teaching: Benefits
 Impact on teaching: Dangers
 Thoughts for the future

What is automated scoring?
 Computer software that automatically
assigns scores to writing or speaking
samples
 Essays can be assigned scores instantly by
computer
 Test takers can call a testing center and take
an oral test without speaking to a human
 Scores reported instantly
 Some level of feedback given to test takers
 Variety of software approaches

How does a computer grade a test?
 Approach #1: Natural Language Processing
(NLP)
 Software identifies and counts linguistic features
 Software does not attempt to gauge content in any
way
 Used for testing writing
 Approach #2: Speech Recognition
 Software compares speech sample to a large
database of samples of the same test question(s)
 Faster responses are “more fluent”, etc.
 Used for testing speaking

Example1: E-Rater (ETS)
 Automated scoring of timed essays
 Uses NLP
 Currently used in limited way to rate:
 TOEFL
 GRE
 Used for formative assessment (Criterion,
Scoreitnow!, TOEFL Practice Online)
 Individual assessment
 Students turn in essays, receive scores, revise,
repeat

What does E-Rater do with an
essay?
 Global measures:
 Count total words, total sentences,
 sentence length, # of paragraphs
 Vocabulary measures:
 # of unique words used ÷ total words (lexical
diversity)
 # of low-frequency words (lexical depth)
 # of prompt-specific words (topic appropriateness)

What does E-Rater do with an
essay? #2
 Grammatical measures:
 Dependent, independent clauses
 Passive voice
 Subject-verb agreement, etc.
 Other measures:
 Sequencing words (then, next, etc.)
 Logical relations (as a result, however, etc.)
 Mechanics (punctuation, etc.)

What is a “good” essay
according to E-rater?
 Long (longer is always better for e-rater)
 Standard structure
 Longer sentences, many dependent clauses
 Many explicit organizational words
 Obscure vocabulary
 Indubitably > Surely
 Obfuscate profusely > Lie a lot
 Wide range of vocabulary

What does E-rater not
notice? “Teaching assistants are paid an excessive
amount of money. The average teaching
assistant makes six times as much money as
college presidents. In addition, they often
receive a plethora of extra benefits such as
private jets, vacations in the south seas, a
staring roles in motion pictures. Moreover, in
the Dickens novel Great Expectation, Pip
makes his fortune by being a teaching
assistant.” (Perelman 2012)

Criterion: E-rater application
 Designed for in-class use
 Students’ essays are instantly scored using
E-rater software
 Essays get individualized feedback on errors
and style
 Students directed to materials for self-study
and revision of essay
 Process repeated
 Used in many schools worldwide

Example2: Versant
(Pearson)
 First fully automated oral language test used
commercially
 Developed by Ordinate corp, bought by Pearson
 Test is taken in computer lab using microphone
and headset, or over the telephone
 Computer automatically rates the speech and
produces scores
 Used widely in business, increasingly in schools
 Many versions, multiple uses and languages

What is a Versant test like?
 About 15 minutes long
 Several simple task types:
 Repeating sentences
 Scrambled Sentences
 “Oral multiple choice”
 All responses totally scripted
 Optional “Free response” final question
 Not scored, but saved for reference

What does Versant do with speech?
 Test takers’ speech is captured by a
microphone and processed in computer
server
 This speech is “compared to” a large
database of human-scored responses
 Native speakers from different countries
 English learners from different countries, of all
different proficiency levels
 Scores given in the range of “most similar”
responses to the test taker
 Scores available immediately

What is a “good” Versant
response?
 Fast response (fluency score)
 Clear
 Accurate (the sentence is repeated exactly,
etc.) (sentence mastery + vocabulary)
 Native-like pronunciation (pronunciation score)
 We talk about Global English nowadays!
 “Comprehensibility” is more important than native-
like speech (Celce-Murcia, Brinton, & Goodwin 2010)

What does Versant NOT
measure?
 Range of vocabulary
 Extended speaking
 Pragmatics, cultural awareness
 Ability to interact with others

What are some advantages of
these systems?
 Reliability
 Computers do not get tired
 Computers are not biased for or against
individuals
 Scores are more consistent than with human
raters (Bernstein, Van Moere et al 2010)
 Practicality
 Automated scoring is much less expensive than
human rating
 Scores and feedback obtained instantly

What does research show?
 When test takers are acting “in good faith”,
scores are roughly equivalent of human raters
 Bridgeman et al (2005): E-rater scores are similar to
humans for most nationalities
 Bernstein, Van Moere, & Cheng (2010),Van Moere
(2012): Versant scores correspond closely to scores
from interview assessments
 *Even though the final scores are very similar,
the tests do not actually measure the same
things (Chun 2006)

Problems with automated
scores Automated tests can be “gamed” or
tricked
 Farnsworth (2013): Versant scores can
be quickly raised by coaching, but
similar results found with an interview
assessment
 Monaghan & Bridgeman (2005): E-rater
scores cannot be used without human
raters for “real” testing (TOEFL, etc.)

How does this positively affect
teaching?
 Writing feedback: Students may get
more (and faster) feedback on:
 Grammar errors in writing
 Lexical errors in writing
 Oral feedback: Teachers may be able to
more often assess students’ speaking
skills

How might this negatively affect
teaching?
 Washback: Effect of testing on instructional
practice (Wall 1999, Bachman & Palmer
1996)
 Teachers tend to focus on what is tested
(Bailey 1999)
 What is tested is different in automated
scoring
 Mismatch between current ideas in
Communicative Language Teaching vs.
automated scoring

Effects on writing instruction
 Increased focus on grammatical
accuracy and low-frequency vocabulary
 Heavy focus on traditional essay
structure and devices
 Decreased focus on quality of content,
selection of examples, style, etc.
 “Use a lot of high level vocabulary, make
sentences longer, mimic conventional
thinking on the topic”

Effects on oral instruction
 Increased focus on oral repetition, word-
level pronunciation
 Increased focus on speed of response
 Decreased focus on pragmatic / cultural
components of language
 Decreased focus on critical thinking

Maybe this is a good thing?
 Some argue that we should return to a
greater focus on structure, vocabulary, speed,
and pronunciation (Van Moere 2012a, 2012b)
 Focus on grammatical forms, linguistic
structures certainly is beneficial
 Students consistently express a desire for
direct instruction on fundamentals

Conclusion
 Computer-scored testing is in all our
futures
 Provides compelling practical benefits
 Students benefit from frequent feedback on
grammar and vocabulary
 Does not (cannot) measure the same
things as humans measure
 Great danger of limiting instruction and
curriculum to grammar, vocabulary, speed,
and pronunciation

Thank you!
Tim Farnsworth
tfarnswo@hunter.cuny.edu
Powerpoint is on
Slideshare :

Automated Language Assessment Scoring and impact on instruction

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Automated Language Assessment Scoring and impact on instruction

Ähnlich wie Automated Language Assessment Scoring and impact on instruction (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Automated Language Assessment Scoring and impact on instruction