Difficulty Index, Discrimination Index, Reliability and Rasch Measurement Analysis

©drtamil@gmail.com - 2016
Difficulty Index, Discrimination
Index, Reliability & Rasch
Measurement Analysis
Azmi Mohd Tamil
Universiti Kebangsaan Malaysia

Steps to assess questionnaire
• Flesch reading ease – assess readability.
• Index of difficulty – proportion of persons answering
correctly.
• Item discrimination – how well the item discriminates
between those with a high & low knowledge score.
• Reliability/Ferguson’s Sigma Discriminatory Power
• Inter-item correlation matrix
• Item-total correlations
• Cronbach’s alpha – inter-item consistency.
• Factor analysis

Topics covered in this lecture
• Topics to be covered;
– Index of difficulty
– Item discrimination
– Reliability
– Item & Person Matching
(Rasch Measurement Analysis).
https://ppukm.org/2015/04/02/
calculating-omr-indexes/
https://ppukm.org/2015/04/16/matching-the-right-questions-to-the-
right-students-rasch-model-for-measurement/

Convert text file to Excel
• It is difficult to enter the answers for each
questions into SPSS if there were a lot of
questions and a large number of students.
• So instead we make use of OMR machines and
scan their answers into a text file.
• Then convert the text file into Excel.
• http://www.palmx.org/mambo/content/view/
173/45/

Sample of A Raw Text File

Sample of A Raw Text File
Matric.
No.
Answers

Convert txt into Excel

Import into SPSS

Convert into Correct=1,Wrong=0
Converted Excel file available from http://drtamil.me/2016/03/02/fk6193-practical-diff-
disc-index/ for use in the coming exercises.

Index of Difficulty
• D = students with correct answer x 100
total students

PPUKM uses <30 (Difficult), 30 to 70 (Okay), >70 (Easy)
cut-off points.

Discrimination Index
• R = (H – L)
27% of Total
• H = number of correct answers from top 27% of
students
• L = number of corrects answers from bottom 27%
of students
• 27% out of 22 = 6 students.

Interpretation of Discrimination Index
• PPUKM uses; negative & 0.00 (non-disc), 0.01
to 0.15 (poor), 0.16 to 0.25 (marginal), 0.26 to
0.35 (good),>0.35 (Excellent),

Using the earlier Excel file,
we can calculate the D & R
• Example for D
– =SUM(B2:B23)/COUNT(B2:B23)*100 for the first
question
• Example for R
– Sort the total marks from largest to smallest
– =((SUM(B2:B7))-(SUM(B18:B23)))/6 for the first
question

Select, Copy, New Sheet,
Paste Special, Transpose
Click on Transpose

Import into SPSS & Analyse
Item 27, difficult and indiscriminate,
need to review

Reliability - Kuder and Richardson
Formula 20
The test checks the internal consistency of
measurements with dichotomous choices. A
correct question scores 1 and an incorrect
question scores 0. The test statistic is

• k = number of questions
• pj = number of people in the sample who answered
question j correctly
• qj = number of people in the sample who didn’t answer
question j correctly
• σ2 = variance of the total scores of all the people taking
the test = VARP(R1) where R1 = array containing the
total scores of all the people taking the test.
• Values range from 0 to 1. A high value indicates
reliability, while too high a value (in excess of .90)
indicates a homogeneous test.

From Our Table
k = 30, Sum of pj & qj = 4.7521, σ2 = 31.827
ρKR20 = (30/29)*(1-(4.7521/31.827))
= 0.88
High reliability, almost homogeneous.

From Our Table
σ2 = 31.827, ρKR20 = 0.88
S.E.M. = Standard deviation * SQRT (1 – Reliability)
= SQRT(31.827) * SQRT (1 – 0.88)
= 1.95.

Conclusion
• Item 27 need to be reviewed due to being
both indiscriminate (R=0.00) and difficult
(D=32).
• Average DI is Easy at 76.5.
• Average RI is Good at 0.36.
• ρKR20 is very reliable at 0.87.
• SEM is very small at 1.95.
• Overall a good set of examination questions.

Were the questions too easy?
Brief Introduction to
Rasch Measurement Analysis

Rasch Measurement Analysis
• Earlier we learnt about Difficulty and Discrimination
Index.
• In Rasch, Item Measure of Difficulty (Di) is similar to
Difficulty Index (D). Instead of Discrimination Index
(R), Rasch has Persons’ Measure of Ability (Bi).
• A high Discrimination Index (R) question is able to
discriminate the good from poor students.
• With Rasch, the higher is a person’s ability (Bi) the
more likely he is able to answer a difficult question.
Therefore an indiscriminate item is detected when a
high-ability person cannot answer an easy item or a
low-ability person can answer a difficult item.
• In Rasch, they use logit instead of rate or %.

Difficulty & Ability Measures
• Item Measure of Difficulty (Di)
– Logit (number of wrong answers/number of
correct answers)
– Measured according to the items/questions.
• Persons’ Measure of Ability (Bi)
– Logit (number of correct answers/number of
wrong answers)
– Measured according to persons.

e.g. If you have 100 students, with a difficult question maybe 99 will get it
wrong and only 1 get it right. Odds of 99/1 is a logit of 4.5951.
A moderate difficulty question, maybe 50 will get it wrong and 50 will get
it right. Odds of 50/50 is a logit of 0.
A slightly easier question, maybe 25 will get it wrong and 75 will get it
right. Odds of 25/75 is a logit = -1.0986.
“Measurement is defined as the assignment of numerals to
objects or events according to rules.”
(“On the Theory of Scales of Measurement”; S.S. Stevens, 1946)
Rasch Model ‘logit’ scale for Di
25
75
e-4.6
-4.6
75
25
50
50
99
1
1
99
e0 e4.6
0 4.6-1.1 1.1
exp
logit
Now, we already have a SCALE with a unit termed ‘logit’ for Di.
-2.0
12
88
37
63
-0.5
63
37
88
12
0.5 2.0
e-1.1 e1.1

e.g. If you have 100 questions, with a good student maybe she will get 99
questions right and only get 1 wrong. Odds of 99/1 is a logit of 4.5951.
A moderately able student, maybe she will get 50 questions right and the
other 50 wrong. Odds of 50/50 is a logit of 0.
A weak student, maybe he will get 25 questions right and the other 75
wrong. Odds of 25/75 is a logit = -1.0986.
“Numerals can be assigned under different rules leads to
different kind of scales & different kinds of measurement.”
(“On the Theory of Scales of Measurement”; S.S. Stevens, 1946)
Rasch Model ‘logit’ scale for Bi
25
75
e-4.6
-4.6
75
25
50
50
99
1
1
99
e0 e4.6
0 4.6-1.1 1.1
exp
logit
Now, we already have a SCALE with a unit termed ‘logit’ for Bi.
-2.0
12
88
37
63
-0.5
63
37
88
12
0.5 2.0
e-1.1 e1.1

Measures of Item Difficulty (Di)
& Person’s Ability (Bi)
Please look closely at the Difficulty Index (D) and Measures of Item Difficulty (Di).
They are inversely related.

Import SPSS data into Winsteps
1. Open Winsteps
2. Click on Excel/RSST
3. Click on SPSS button
4. Click on Select SPSS file
5. Your SPSS file will be
imported into “SPSS
Processing for Winsteps”
window.
6. Now copy and paste the
identifying data and item
data.

1. Cut the matric. no. and
paste under “Person
Label Variables”.
2. Then cut all question
items then paste under
“Item Response
Variables”
3. Click the “Construct
Winsteps file” button .

The end
product, a
Winsteps
file, which is
really a text
file.

Winsteps
file after
convert

Open the file in Winsteps & press
Enter twice.
The squared sum residuals
of the entire matrix.
Iteration is done until this
value is as close as
possible to 0.

Fit Statistics – How well was it measured?

The Manual & Winsteps
Measures Differ!
item
Manual
Manual
As though JMLE sets item 7 measure as 0, as anchor to other items/persons measures.
Winsteps

Manual Measures and Winsteps (Rasch
Software) Measures are not the same. Why?
• Due to Joint Maximum Likelihood Estimation (JMLE)
• Winsteps implements three methods of estimating Rasch
parameters from ordered qualitative observations: JMLE
and PROX. Estimates of the Rasch measures are obtained
by iterating through the data. Initially all unanchored
parameter estimates (measures) are set to zero. Then the
PROX method is employed to obtain rough estimates. Each
iteration through the data improves the PROX estimates
until they are usefully good. Then those PROX estimates are
the initial estimates for JMLE which fine-tunes them, again
by iterating through the data, in order to obtain the final
JMLE estimates. The iterative process ceases when the
convergence criteria are met.
• Confused?

Items Difficulty Measure – similar but not
the same due to different reference point.

Persons’ Ability Measures (Bi) - similar but not
the same due to different reference point.

JMLE adjustment
• So it is as though the JMLE sets one item (item 7)
as the anchor reference point, then all other
items/persons measures are adjusted
accordingly.
• So the differences between all the measures are
still the same, as shown in the scatter diagram.
• So the manual measures and Winsteps measures
are similar. r = 1. Just the reference point is
changed or adjusted.
• Of course the real calculation is much more
complicated.

JMLE adjustment
• First they get the average of all the Item Measures of
Difficulty (Di).
• Then minus the value of the average from all the Item
Measure of Difficulty (Di);
i.e. (-1.2) - (-1.41) = 0.2
• The average from all this new Di s would be equal to 0.

Recalculate the Probability of
Answering Correctly
• Recalculate using the old Item Measure of Ability
(Bi) and the new Item Measure of Difficulty (Di).
e (βn – δi )
P(Ɵ) =
1 + e (βn – δi )

SUM UP VARIANCE OF EXPECTED VALUES
• Calculate the variance of expected values for each cell;
P*(1-P)
• Sum up the variances according to rows and columns.
• Negative sums in blue column act as denominator to tweak
persons ability logit to maximise fit.
• Negative sums in yellow row act as denominator to tweak
items difficulty logit to maximise fit.

New Bi & new adjusted Di
• Tweak the Persons Measure of Ability (Bi) based on the sum of residuals (Observed – Expected) and sum
of variances of expected values;
New Bi = Old Bi – (Sum of Variances/Sum of Residuals)
• Tweak the Item Measure of Difficulty (Di) based on negative sum of residuals (Observed – Expected) and
sum of variances of expected values;
New Di = Old adj. Di – (Sum of Variances/Negative Sum of Residuals)
• Get the average of all the new Item Measures of Difficulty (Di). Then get the new adj. Di by deducting the
value of the average from all the new Item Measure of Difficulty (Di);
i.e. (-0.046) - (-0.211) = 0.165 ~ 0.17
• Keep doing the iteration until the squared sum residuals of the entire matrix is as close as possible to 0.
• For this dataset, 4 iterations was required before that was achieved.

Need More Analysis?

Checking the Fit
Bubble Plot

Plots – Bubble Chart

Bubble Plot
Okay Erratic
Too good
to be true

Bubble Plot
- Items Measure is the Y-axis.
- Model S.E. is the diameter of the circle
(reduced to 60%).
- InFit Z-STD is the X-axis
- We expect difficult items can be
answered by the more able persons and
easy items could be answered by all.
- Item 27 is considered erratic, although it
is difficult, both able and weak person
cannot answer it.
Under the previous exercise, for item 27; the
Difficulty Index was D=32 and Discrimination
Index was R=0.00. Difficult and yet unable to
discriminate. So it was already detected as a
problem question in earlier analysis.

Bubble Plot
Okay Erratic
Too good
to be true
27; 2.7,2.89, 0.56
14; -0.8,1.21, 0.53
23; -1.3, 2.82, 0.57
8; 0.0, -3.40, 1.85
Lowest item
Largest error of 1.85,
so largest bubble.
20; 1.3, 1.21, 0.53
11; 1.4, 0.34, 0.56
17;0.2,-2.12,1.06
24; 0.2,-2.12,1.06
26; 1.7, 0.74, 0.58
28;0.5,-0.35,0.625;-1.0,-0.35,0.62
12;-1.1,-0.35,0.62
19;-1.0,0.24,0.59
Item No.; t Infit Zstd, Item
Measures, S.E.
Graph showing how it is
plotted using the Infit Zstd
and Item Measures.
Bubble Size is 60% of the
Model S.E.
Zstd for 27 larger than 2.0
therefore an erratic item.
Should be checked.

Bubble Plot
Okay Erratic
Too good
to be true
27;D32,R0.00
14;D59,R0.83
23;D32,R0.83
8;D100,R0.00
Item 8 has high Difficulty
Index & yet poor
Discrimination Index.
20;D59,R0.33
11;D73,R0.33
17;D96,R0.17
24;D96,R0.17
26;D64,R0.33
28;D82,R0.335;D82,R0.67
12;D82, R0.67
19;D73,R0.83
Item No.; Difficulty Index,
Graph showing the
relationship of Difficulty
Index with Item Measures
and Discrimination Index
with Infit Zstd.
Item 27 has low Difficulty
Index and yet poor

Wright Map

Person Item Map
Mean for Persons
Mean for Items
Need tougher questions
to test A, B, C D, E, F & G.
Questions too easy.
Not testing anybody.
1. Poor Students;
n=4 (8%)
2.Good students;
n=18 (82%)
On target. Between
the mean + 1sd.
13/30 = 43%

Item Measures

Scan the InFit
Zstd for values
larger than 2.0.
For item 27; the
InFit Zstd is 2.7,
therefore larger
than 2.0. As
stated earlier,
such items are
considered
erratic and
should be
removed or
changed.

©drtamil@gmail.com - 2016VERY DIFFICULT
= +2.89 logit
N=21, score=7
avg.=0.33,
Many got it
wrong.
BOTH y,z BREACHED
ITEM NEED REVIEW
Large +Z due to inconsistency
in response. e.g. poorly able
person can answer a difficult
question.
-2 < Z < +20.5 < y < 1.5 0.32 < x < 0.8
LOW PT. MEASURE
CORELATION . SOME
POOR STUDENTS CAN
ANSWER ITEM
CORRECTLY WHILE
GOOD STUDENTS GOT
IT WRONG

LOW PT. MEASURE
CORELATION . SOME
POOR STUDENTS CAN
ANSWER ITEM
CORRECTLY WHILST
GOOD STUDENTS GOT
IT WRONG
0.32 < x < 0.8
EXTREMELY EASY
=-3.40 logit
N=22, score=22
ave.=1, all correct

Person Measures

Scan the InFit
Zstd for values
larger than 2.0.
For Person T & U;
the InFit Zstd is
larger than 2.0.
Such erratic
performance
could be due to
them getting
some very easy
questions wrong.

Expected Score ICC
The InFit Zstd of item 27
is 2.7, therefore larger
than 2.0.
So we will check the ICC
of item 27, why it is
erratic.
For comparison we will
also look at ICC of item
23, with the Infit Zstd of
-1.3.

ICC
Item
27,
Di= 2.9
R= 0.0
For a difficult
question, a less able
person shouldn’t be
able to answer.
e (βn – δi )
P(Ɵ) =
1 + e (βn – δi )
where;
e= Euler’s Number, 2.7183
βn= Person’s ability measure
δi= item difficulty measure
But a more able
person should be
able to answer.
But the blue line
is not following
the red line. So
the ability is not
consistent.

ICC
Item
23,
Di= 2.8
R=0.8
For a difficult
question, a less able
person shouldn’t be
able to answer.
e (βn – δi )
P(Ɵ) =
1 + e (βn – δi )
where;
But a more able
person should be
able to answer.
Here the blue
line is following
the red line. So
the ability is
consistent with
the difficulty.

0.66; ‘Poor’ item instrument
reliability in measuring student
learning ability. Poor targeting.
Summary
Statistics (All)
+ve Person mean
μ = 1.75 logit
P[Ɵ]LOi=e(1.75+0.11)/(1+e(1.75+0.11))
= 6.4237/7.4237
= 0.865 (easy to pass)
G=1.90
Separation index=(4G+1)/3=2.9
‘Good’ Person separation into
3 groups. So can have 3 grades!
0.78 ‘Fair’ person reliability
Cronbach-α :0.87 Good
reliability assessment of student
learning. Same value as ρKR20
calculated manually.
Presence of one extreme
person, “Mrs A” and one
extreme item, “Item 27”,
causes the InFit & OutFit
MNSQ & Zstd not calculated.
e (βn – δi )
P(Ɵ) =
1 + e (βn – δi )
where;

How ρKR20 was calculated earlier.
k = 30, Sum of pj & qj = 4.7521, σ2 = 31.827
ρKR20 = (30/29)*(1-(4.7521/31.827))
= 0.88
High reliability, almost homogeneous.

0.65; ‘Poor’ item
measurement reliability in
measuring student learning
ability. Poor targeting.
Summary
Statistics
(minus1)
+ve Person mean
μ = 1.58 logit
P[Ɵ] LOi= e(1.58-0)/(1+e(1.58-0))
= 4.855/5.855
= 0.83 (easy to pass)
InFit Zstd close to 0 & MNSQ
close to 1, so data fit the
model.
G=1.97
Separation index=(4G+1)/3=2.96
‘Good’ Person separation of 3
groups. So can have 3 grades!
0.80 ‘Fair’ person reliability
Cronbach-α not available after
exclusion of extreme person
data.
Interpretation of Person & item measurement reliability;
-<0.67 is Poor
- 0.67 – 0.80 Fair
- 0.81 – 0.90 Good
- 0.91 – 0.94 Very Good
e (βn – δi )
P(Ɵ) =
1 + e (βn – δi )
where;

Guttman Scalogram

Guttman
Scalogram
SMART Student-01
POORStudent-22
EASY ITEMS DIFFICULT ITEMS
RESPONSE SORTED: EASY TO TOUGH

Guttman
Scalogram
Theorem 1. Persons who are more able / more developed
have a greater likelihood of correctly answer all the items /
able to complete a given task.
Theorem 2. Easier items / task are more likely to be
answered correctly by all persons.
CARELESS
PREDICT=1
PREDICT=0
GUESS

Guttman
Scalogram
as similarly
arranged
as
descending
Difficulty
Index.

Item Misfit Response String

Item Misfit Response String
Most misfit item:
Exceed MNSQ Limit:
0.5 < y < 1.5
High Rating Response Zone 1.
Item in red circles for the
respective Persons were under
rated
Low Rating Response 0.
Item in blue circles for
the respective Persons
were over rated

Conclusion
• Item 27 needs to be relooked at. Needs better
rephrasing of the question. Or better choice of
answers
• Rasch came out with almost the same kind of
conclusion as our earlier analysis.
• Rasch also identifies gaps in questions selection,
showing us that there were too many easy
questions, and absence of difficult questions that
could test those with higher ability. Something
the conventional analysis couldn’t show clearly.

Conclusion (continued)
• Among the figures that were similarly generated
between both methods were;
– Difficulty Index D versus Item Measure of Difficulty Di
(inverse relationship)
– ρKR20 of 0.88 (manual) and 0.87 (Rasch).
– Both method detected item 27 as an erratic item. Of
course Rasch proved it with multitudes of figures and
data.
– With the conventional method, it requires experience
to realise items with low difficulty index and non-
discriminatory discrimination index as problem items.

Conclusion (continued)
• Most importantly, Rasch detected the following;
– ‘Poor’ item instrument reliability of 0.66; in
measuring student learning ability. Poor targeting.
– μ Bi= 1.75 logit; μ Di=-0.11
P[Ɵ]LOi=e(1.75+0.11)/(1+e(1.75+0.11)) = 0.865
– Based on the above, questions were too easy.
– Separation index of 2.9 indicating ‘Good’ Person
separation into 3 groups. So can have 3 grades!
Excellent, Pass, Fail? Instead of the usual
A, A-, B+, B, B-, C+, C, C-, D & E?

Difficulty Index, Discrimination Index, Reliability and Rasch Measurement Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Difficulty Index, Discrimination Index, Reliability and Rasch Measurement Analysis

Similar to Difficulty Index, Discrimination Index, Reliability and Rasch Measurement Analysis (20)

More from Azmi Mohd Tamil

More from Azmi Mohd Tamil (20)

Recently uploaded

Recently uploaded (20)

Difficulty Index, Discrimination Index, Reliability and Rasch Measurement Analysis