Evaluation of the reliability for L2 speech rating in discourse completion testMethoken in seoul

Evaluation of the
reliability for L2 speech
rating in discourse
completion test
Yusuke Kondo and
Yutaka Ishii

Prediction method used in
automated scoring system for L2
1
0
01
Item x
Item x
Predictors
Speech rate
Pitch range
Mean length of utterance
2

Predictor examination
0
1 01
IndexA
Index B
IndexC
Index D
Good predictors Bad predictors
When we try to predict scores using two indices …,
3

Unreliable rating
= 0
= 1
IndexA
Index B
The first rating The second rating
IndexA
Index B
4

Ishii and Kondo (2015)
5
.27 .57
Our own ratings Ratings in Narita (2013)

Agreement of automated scoring with raters
Group Correlation % Exact
Agreement
% Adjacent
Agreement Kappa
Weighted
Kappa
Naïve .77 41 89 .27 .75
Untrained .61 31 73 .16 .59
Certificated (Average) .92 70 99 .62 .91
Certificated (Exemplary) .95 80 100 .76 .94
Powers, Escoffery, and Duchnowski (2015) Applied Measurement in Education
Untrained < Naïve < Certificated (Average) < Certificated (Exemplary)
6

Comes as no surprise
• Reliable rating is absolutely
essential for the construction of
automated scoring system.
7

Then,
• how do we evaluate reliability in
L2 performance?
• What index should be used?
8

Outline
• Reliability indices in L2
performance assessment
• Reliability indices in
psychometrics
• Observation of reliability indices
• Some comments and suggestions
9

Language Testing 30-32
• Reliability indices used
1. Cronbach’s Alpha
2. Percentage of agreements
3. Cohen’s kappa
4. Spearman rank correlation coefficient
5. Pearson correlation coefficient
6. Infit and Outfit measures (IRT)
7. Root-mean-square deviation
10

Alpha in rating data
• Bachman (2004) “coefficient
alpha should be used”
• Bachman’s recommendation is
introduced in Carr (2011) and
Sawaki (2013).
11

Journals on psychometrics
• Reliability indices discussed
1. Polychoric correlation coefficient
2. McDonald’s omega
3. Intraclass correlation coefficient
4. Standard deviation of correlation coefficients
5. Means of correlation coefficients
12

Next,
• we will be looking at how the
reliability indices behave in our
rating data.
13

Data
• 30 different discourse completion
task completed by 44-60
university students.
• Each utterance was rated by
different three raters
14

Example
When you (A) want to ask your friend
about their weekend, what would you
say in the conversation below?
A: ( )
B: We went shopping.
15

Rating criteria
Score Description
3
Can understand the speaker’s intention. Natural pronunciation and
Intonation. Almost no foreign accentedness.
2 Can understand the speaker’s intention, but can find some foreign accents.
1 Can’t understand the speakers’ intention because of strong foreign accents
0 Can’t catch the utterance because of low voice or noise.
16

Target indices
• Cronbach’s alpha
– Kendall
– Spearman
– Pearson
– Polychoric
• McDonald’s omega
• Mean of correlation
coefficients
• Fleiss’ kappa
• Percentage of exact and
adjacent agreement
17

Data frame
α_k α_spe α_pea α_pol . . . κ %
Item 1 .47 .53 .48 .74 . . . .22 .75
Item 2 .56 .55 .55 .67 . . . .25 .80
Item 3 .62 .67 .64 .59 . . . .30 .90
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Item 30 .66 .86 .67 .92 . . . .47 .66
18

Much the same.
Mean of correlation
coefficients
Cronbach’s alpha McDonald’s omega
19

Correlations among coefficients
Cronbach’s Alpha
alpha_ken
0.5 0.6 0.7 0.8
0.99 0.91
0.4 0.6 0.8
0.450.600.75
0.79
0.50.60.70.8
alpha_spe
0.93 0.81
alpha_pea
0.500.650.80
0.81
0.45 0.60 0.75
0.40.60.8
0.50 0.65 0.80
alpha_pol
Mean of Correlation Coefficients
m_ken
0.3 0.4 0.5 0.6
1.00 0.92
0.2 0.4 0.6 0.8
0.20.30.40.5
0.74
0.30.40.50.6
m_spe
0.94 0.76
m_pea
0.30.40.50.6
0.78
0.2 0.3 0.4 0.5
0.20.40.60.8
0.3 0.4 0.5 0.6
m_pol
20

Correlations among coefficients
McDonald’s omega
omegah_ken
0.50 0.60 0.70 0.80
0.97 0.86
0.3 0.5 0.7 0.9
0.500.600.700.80
0.69
0.500.600.700.80
omegah_spe
0.91 0.73
omegah_pea
0.550.650.750.85
0.67
0.50 0.60 0.70 0.80
0.30.50.70.9
0.55 0.65 0.75 0.85
omegah_pol
21

Comment
• Much the same results can be
obtained by Spearman’s and
Pearson’s in 4-point scale.
22

Suggestion
• Polychoric correlation coefficients
should be used, if you would
prefer not to violate statistical
constraints and/or to
underestimate the reliability of
your data.
23

Reason
• Pearson’s should not be used for
rating data.
• Use Spearman’s instead.
• But, their correlation is extremely
high.
• They might share their construct.
24

Correlation among indices
Kendall’s based indices
m_ken
0.45 0.55 0.65 0.75
0.99
0.20.30.40.5
0.97
0.450.550.650.75
alpha_ken
0.97
0.2 0.3 0.4 0.5 0.50 0.60 0.70 0.80
0.500.600.700.80
omegah_ken
Spearman’s-based indices
m_spe
0.5 0.6 0.7 0.8
0.99
0.30.40.50.6
0.96
0.50.60.70.8
alpha_spe
0.97
0.3 0.4 0.5 0.6 0.50 0.60 0.70 0.80
0.500.600.700.80
omegah_spe
25

Correlation among indices
Pearson’s-based indices
m_pea
0.50 0.60 0.70 0.80
0.99
0.30.40.50.6
0.95
0.500.600.700.80
alpha_pea
0.95
0.3 0.4 0.5 0.6 0.55 0.65 0.75 0.85
0.550.650.750.85
omegah_pea
Polychoric-based indices
alpha_pol
0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.94
0.40.50.60.70.80.9
0.98
0.30.40.50.60.70.80.9
omegah_pol
0.88
0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.20.30.40.50.60.70.80.9
m_pol
26

Suggestion
• Mean of correlation coefficients,
Cronbach’s alpha, and
McDonald’s omega, you can use
any of them.
27

ICC, Kappa, and %
α M of r ω ICC κ %
α 1 .98 .94 .75 .54 .53
M of r .98 1 .88 .72 .54 .44
ω .94 .88 1 .74 .48 .58
ICC .75 .72 .74 1 .81 .72
κ .54 .54 .48 .81 1 .61
% .53 .44 .58 .72 .61 1
α : α using polychoric correlation coefficients
M of r : Mean of polychoric correlation coefficients
ω : ω using polychoric correlation coefficients
ICC : Intraclass correlation coefficients
κ : Fleiss’ kappa
% : Percentage of exact and adjacent agreements
28

Comment
• “Agreement” may be a construct
different from “reliability.”
29
Rater A Rater B
↑
True score
Agreement
↓

• One more thing, we have found
30

A feature of alpha
A B C D E
A 1
B .7 1
C .7 .7 1
D .7 .7 .7 1
E .7 .7 .7 .7 1
F G H I J
F 1
G .9 1
H .9 .9 1
I .5 .5 .5 1
J .6 .6 .6 .9 1
Table 1: Item A Table 2: Item B
𝛼 = .92 𝛼 = .92
The tables were created, based on Schmitt (1996)
Psychological Assessment
To show the difference, SD of correlation coefficients is
recommended to be reported.
31

In our data
K L M
K 1
L .80 1
M .45 .90 1
0.05
0.10
0.15
0.20
0.4 0.6 0.8
Alpha
SD
N O P
N 1
O .95 1
P .92 .76 1
32

Comments
• Even if we obtain much the same
alphas, the correlations among
raters are different in two items.
33

Another feature of alpha
Q R S
Q 1
R .7 1
S .7 .7 1
T U V X Y Z
T 1
U .7 1
V .7 .7 1
X .7 .7 .7 1
Y .7 .7 .7 .7 1
Z .7 .7 .7 .7 .7 1
𝛼 = .87
𝛼 = .93
a b c d e f
a 1
b .5 1
c .5 .5 1
d .5 .5 .5 1
e .5 .5 .5 .5 1
f .5 .5 .5 .5 .5 1 𝛼 = .86
34

Final suggestions
• When you report on the
reliability in the rating data with
more than two raters,
– Polychoric correlation coefficients should be used.
– SD of correlation coefficients among raters is
recommended to be reported.
– Mean of correlation coefficients might be used
instead of alpha (, which might be more
comprehensible than alpha).
35

Outline
• Reliability indices in L2
performance assessment
• Reliability indices in
psychometrics
• Observation of reliability indices
• Some comments and suggestions
36

Evaluation of the reliability for L2 speech rating in discourse completion testMethoken in seoul

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Evaluation of the reliability for L2 speech rating in discourse completion testMethoken in seoul

Ähnlich wie Evaluation of the reliability for L2 speech rating in discourse completion testMethoken in seoul (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Evaluation of the reliability for L2 speech rating in discourse completion testMethoken in seoul