SlideShare ist ein Scribd-Unternehmen logo
1 von 59
A. RELIABILITYA. RELIABILITY
CHARACTERISTICS OF ACHARACTERISTICS OF A
GOOD TESTGOOD TEST
ReliabilityReliability
• Reliability is synonymous with consistency. It is
the degree to which test scores for an individual
test taker or group of test takers are consistent
over repeated applications.
• No psychological test is completely consistent,
however, a measurement that is unreliable is
worthless.
Would you keep using these
measurement tools?
The consistency of test scores is critically
important in determining whether a test
can provide good measurement.
When someone says you are a
‘reliable’ person, what do they really
mean?
Are you a reliable person? 
Reliability (cont.)Reliability (cont.)
* Because no unit of measurement is exact, any time you
measure something (observed score), you are really
measuring two things.
1. True Score - the amount of observed score that truly
represents what you are intending to measure.
2. Error Component - the amount of other variables that
can impact the observed score
Observed Test Score = True Score + Errors of
Measurement
Measurement ErrorMeasurement Error
• Any fluctuation in test scores that results from
factors related to the measurement process that
are irrelevant to what is being measured.
• The difference between the observed score and
the true score is called the error score. S true = S
observed - S error
Measurement Error is Reduced By:
- Writing items clearly
- Making instructions easily understood
- Adhering to proper test administration
- Providing consistent scoring
Determining ReliabilityDetermining Reliability
• There are several ways that measuring reliability can be
determined, depending on the type of measurement the
supporting data required. They include:
- Internal Consistency
- Test-retest Reliability
- Inter rater Reliability
- Split-half Methods
- Odd-even Reliability
- Alternate Forms Methods
Internal ConsistencyInternal Consistency
• Measures the reliability of a test solely on the number of
items on the test and the inter correlation among the
items. Therefore, it compares each item to every other
item.
Cronbach’s Alpha: .80 to .95 (Excellent)
.70 to .80 (Very Good)
.60 to .70 (Satisfactory)
<.60 (Suspect)
Split Half & Odd-Even ReliabilitySplit Half & Odd-Even Reliability
Split Half - refers to determining a correlation between the first
half of the measurement and the second half of the measurement
(i.e., we would expect answers to the first half to be similar to the
second half).
Odd-Even - refers to the correlation between even items and odd
items of a measurement tool.
• In this sense, we are using a single test to create two tests,
eliminating the need for additional items and multiple
administrations.
• Since in both of these types only 1 administration is needed and
the groups are determined by the internal components of the test,
it is referred to as an internal consistency measure.
Split-half reliability
[error due to differences in item content between the halves of the test]
• Typically, responses on odd versus even items are employed
• Correlate total scores on odd items with the scores obtained
on even items
Person Odd Even
1 36 43
2 44 40
3 42 37
4 33 40
1
100
50
pairs
Test-retest ReliabilityTest-retest Reliability
• Test-retest reliability is usually measured by computing
the correlation coefficient between scores of two
administrations.
Test-retest Reliability (cont.)Test-retest Reliability (cont.)
• The amount of time allowed between measures is critical.
• The shorter the time gap, the higher the correlation; the longer
the time gap, the lower the correlation. This is because the two
observations are related over time.
• Optimum time between administrations is 2 to 4 weeks.
• The rationale behind this method is that the difference between
the scores of the test and the retest should be due to measurement
solely.
Inter rater ReliabilityInter rater Reliability
• Whenever you use humans as a part of your measurement
procedure, you have to worry about whether the results you get
are reliable or consistent. People are notorious for their
inconsistency. We are easily distractible. We get tired of doing
repetitive tasks. We daydream. We misinterpret.
Inter rater Reliability (cont.)Inter rater Reliability (cont.)
• For some scales it is important to assess interrater
reliability.
• Interrater reliability means that if two different raters
scored the scale using the scoring rules, they should
attain the same result.
• Interrater reliability is usually measured by computing
the correlation coefficient between the scores of two
raters for the set of respondents.
• Here the criterion of acceptability is pretty high (e.g., a
correlation of at least .9), but what is considered
acceptable will vary from situation to situation.
Parallel/Alternate Forms MethodParallel/Alternate Forms Method
Parallel/Alternate Forms Method - refers to the
administration of two alternate forms of the
same measurement device and then comparing
the scores.
• Both forms are administered to the same person and
the scores are correlated. If the two produce the
same results, then the instrument is considered
reliable.
Parallel/Alternate Forms Method (cont.)Parallel/Alternate Forms Method (cont.)
• A correlation between these two forms is computed just
as the test-retest method.
Advantages
• Eliminates the problem of memory effect.
• Reactivity effects (i.e., experience of taking the test) are
also partially controlled.
Factors Affecting ReliabilityFactors Affecting Reliability
• Administrator Factors
• Number of Items on the instrument
• The Instrument Taker
• Heterogeneity of the Items
• Heterogeneity of the Group Members
• Length of Time between Test and Retest
How High Should Reliability Be?How High Should Reliability Be?
• A highly reliable test is always preferable to a test with
lower reliability.
. 80 > greater (Excellent)
.70 to .80 (Very Good)
.60 to .70 (Satisfactory)
<.60 (Suspect)
• A reliability coefficient of .80 indicates that 20% of the
variability in test scores is due to measurement error.
Reliability deals with the consistency.
Reliability is the quality that guarantees us that
we will get similar results when conducting the
same test on the same population every time.
Consider this ruler…
Now compare this ruler…
With this one…
Each ruler will give the same answer each time…
But this one will be wrong each time…
Each ruler is reliable…
But reliability doesn‘t mean much when it is
wrong…
So, not only do we require reliability…
We also need…
VALIDITY
Good Ruler
Bad Ruler
VALIDITY
Validity deals with the
accuracy of the
measurement
Validity
 Depends on the PURPOSE
 E.g. a ruler may be a valid measuring device
for length, but isn’t very valid for measuring
volume
 Measuring what ‘it’ is supposed to
 Matter of degree (how valid?)
 Specific to a particular purpose!
 Learning outcomes
1. Content coverage (relevance?)
2. Level & type of student engagement
(cognitive, affective, psychomotor) –
appropriate?
Types of validity measures
 Face validity
 Construct validity
 Content validity
 Criterion validity
Face Validity
Does it appear to measure what it is supposed to measure?
Example: Let’s say you are interested in measuring,
‘Propensity towards violence and aggression’. By simply
looking at the following items, state which ones qualify to
measure the variable of interest:
 Have you been arrested?
 Have you been involved in physical fighting?
 Do you get angry easily?
 Do you sleep with your socks on?
 Is it hard to control your anger?
 Do you enjoy playing sports?
Construct Validity
 Does the test measure the ‘human’ theoretical construct
or trait.
 Examples
 Mathematical reasoning
 Verbal reasoning or fluency
 Musical ability
 Spatial ability
 Motivation
 Applicable to authentic assessment
 Each construct is broken down into its component parts
 E.g. ‘motivation’ can be broken down to:
 Interest
 Attention span
 Hours spent
 Assignments undertaken and submitted, etc.
All of these sub-constructs put together – measure ‘motivation’
Content Validity
How well elements of the test relate to the content
domain?
How closely content of questions in the test relates to
content of the curriculum?
Directly relates to instructional objectives and the
fulfillment of the same!
Major concernfor achievement tests (where content is
emphasized)
Can you test students on things they have not been
taught?
How to establish Content Validity?
 Instructional objectives (looking at your list)
 Table of Specification
 E.g.
 At the end of the chapter, the student will be able to
do the following:
1. Explain what ‘stars’ are
2. Discuss the type of stars and galaxies in our universe
3. Categorize different constellations by looking at the stars
4. Differentiate between our stars, the sun, and all other stars
Categories of Performance (Mental
Skills)
Content areas
Knowledge Comprehension Analysis Total
1. What are
‘stars’?
2. Our star, the
Sun
3. Constellations
4. Galaxies
Total Grand
Total
Table of Specification (An Example)
Criterion Validity
The degree to which content on a test (predictor)
correlates with performance on relevant criterion
measures (concrete criterion in the "real" world?)
If they do correlate highly, it means that the test
(predictor) is a valid one!
E.g. if you taught skills relating to ‘public speaking’ and
had students do a test on it, the test can be validated by
looking at how it relates to actual performance (public
speaking) of students inside or outside of the
classroom
Factors that can lower Validity
 Unclear directions
 Difficult reading vocabulary and sentence structure
 Ambiguity in statements
 Inadequate time limits
 Inappropriate level of difficulty
 Poorly constructed test items
 Test items inappropriate for the outcomes being measured
 Tests that are too short
 Improper arrangement of items (complex to easy?)
 Identifiable patterns of answers
 Teaching
 Administration and scoring
 Students
 Nature of criterion
Validity and Reliability
Neither Valid
nor Reliable
Reliable but not
Valid
Valid & Reliable
Fairly Valid but
not very Reliable
Think in terms of ‘the
purpose of tests’ and the
‘consistency’ with which
the purpose is
fulfilled/met
Objectivity
the state of being fair, without bias or external
influence.
if the test is marked by different people, the
score will be the same . In other words, marking
process should not be affected by the marking
person's personality.
Not influenced by emotion or personal
prejudice. Based on observable phenomena;
presented factually: an objective appraisal.
The questions and answers should be clear
 measures an individual's characteristics in a
way that is independent of rater’s bias or the
examiner's own beliefs
gauges the test taker's conscious thoughts and
feelings without regard to the test administrator's
beliefs or biases.
help greatly in determining the test taker's
personality.
Understanding Norms
a list of scores and corresponding percentile ranks,
standard scores, or other transformed scores of a
group of examinees on whom a test was
standardized.
In a psychometric context, norms are the test
performance data of a particular group of test takers
that are designed for use as a reference for evaluating
or interpreting individual test scores” (Cohen &
Swerdlik, 2002, p. 100).
TYPES OF NORMS
•Percentiles
- refer to a distribution divided into 100
equal parts.
- refer to the score at or below which a
specific percentage of scores fall.
Ex. A student got 90% rank of NAT
exam. What does this mean?
It means that 90% of his
classmates scored lower than
his score or 10% of his
classmates got score above his
score.
Age Norms (age-equivalent scores)
–“indicate the average performance of
different samples of test takers who were at
various ages at the time the test was
administered” (Cohen & Swerdlik, 2002, p.
105).
Grade Norms
–Used to indicate the average test
performance of testtakers in a specific grade.
–Based on a ten month scale, refers to grade
and month (e.g., 7.3 is equivalent to seventh
grade, third month).
•National Norms
–Derived from a standardization sample nationally
representative of the population of interest.
Subgroup Norms
–Are created when narrowly defined groups are
sampled.
Ex. •Socioeconomic status
•Handedness
•Education level
Local Norms
–Are derived from the local population’s performance
on a measure.
- Typically created locally (i.e., by guidance counselor,
personnel director, etc.)
Fixed Reference Group Scoring Systems
•Calculation of test scores is based on a fixed
reference group that was tested in the past.
•Norm referenced tests consider the
individual’s score relative to the scores of
testtakers in the normative sample.
•Criterion Referenced tests consider the
individual’s score relative to a specified
standard or criterion (cut score).
–Licensure exams
–Proficiency tests
Item Analysis
A name given to a variety of statistical techniques
designed to analyze individual items on a test
It involves examining class-wide performance on
individual test items.
It sometimes suggests why an item has not
functioned effectively and how it might be
improved
A test composed of items revised and selected on
the basis of item-analysis is almost certain to be
more reliable than the one composed of an equal
number of untested items.
Difficulty index
The proportion of students in class who got
an item correct. The larger the proportion ,
the more students who have learned the
content measured by the item
Discrimination index
A basic measure of the validity of an item.
A measure of an item’s ability to
discriminate between those who scored high
on the total test and those who scored low.
It can be interpreted as an indication of the
extent to which overall knowledge of the
content area or mastery of the skill is related
to the response on an item
Analysis of response options/distracter
analysis
In addition to examining the performance of a test
item, teachers are often interested in examining
the performance of individual distracters
( incorrect answer options) on multiple-choice
items
By calculating the proportion of students who
chose each answer option, teachers can identify
which distracters are working and appear to be
attractive to students who do not know the correct
answer, and which distracters are simply taking up
space and not being chosen by many students
To eliminate blind guessing which
results in a correct answer purely by
chance (which hurts the validity of a
test item), teachers want as many
plausible distracters as is feasible.
The process of item analysis
1. Arrange the test scores from highest to lowest
2. Select the criterion groups
Identify a High group and a Low group. The High
group is the highest-scoring 27% of the group and the Low
group is the lowest scoring 27%
27% of the examinees is called the criterion group. It
provides the best compromise between two desirable but
inconsistent aims:
to make the extreme groups as large as possible
and as different as possible
then we can say with confidence that those in the High
group are superior in the ability measured by the test than
those in the Low group.
3. For each item, count the number of
examinees in the High group who have correct
responses. Do a separate, similar procedure for the
low group
4. Solve for the difficulty index of each item
 The larger the value of the index, the easier the item.
 The smaller the value, the more difficult is the item.
 Scale for interpreting the difficulty index of an item
Below 0.25 item is very difficult
0.25 – 0.75 item is of average difficulty
or item is rightly difficult
Above 0.75 item is very easy
Example: Item analysis
1. Count and arrange the scores from highest to
lowest.
 Ex. n=43 scores
2. Calculate the criterion group (N) which is 27% of
the total number of scores.
 Ex. N=27% of 43= (0.27)(43) = 12
3. Take 12 scores from the highest down and take 12
scores from the lowest up, call these High group and
Low group respectively.
4. Tabulate the number of responses for each options
from the high and low groups for that particular item
under analysis.
5. Solve for the difficulty index of each item
 The larger the value of the index, the easier the
item. The smaller, the more difficult.
 Scale for interpreting the difficulty index of an
item
Below 0.25 item is very difficult
0.25 – 0.75 item is of average difficulty or
item is rightly difficult
Above 0.75 item is very easy
A B C D* E Total
Upper
Group
1 1 0 9 1 12
Lower
Group
3 1 4 4 0 12
Ex: Item # 5 of the Multiple Choice test, D is the correct
option.
Idis Index Description Interpretation
0.40 – 1.0 High The item is very
good
0.30 -0.39 Moderate Reasonably good,
can be improved
0.20 – 0.29 Moderate In need of
improvement
< 0.20 Low Poor, to be
discarded
The following can be used to interpret the
index of discrimination.
Idis Idif Item category
High Easy Good
High Easy/difficult Fair
Moderate Easy/difficult Fair
High/moderate Easy/difficult Fair
low At any level Poor (Discard the
item)
•Interpreting the results by giving value judgment
Index of difficulty = (Hc + Lc) / 2N =
(9+4)/2(12)=.54 ----the item is rightly
difficult
Index of discrimination = (Hc –Lc)/N=(9-
4)/12=.42
---- high index of discrimination
---- the item has the power to
discriminate
Hence, item number 5 has to be
retained.
Distracter analysis: A and C are good
distracters
Thank you and God bless us
all!

Weitere ähnliche Inhalte

Was ist angesagt?

validity its types and importance
validity its types and importancevalidity its types and importance
validity its types and importanceIerine Joy Caserial
 
Reliability (assessment of student learning I)
Reliability (assessment of student learning I)Reliability (assessment of student learning I)
Reliability (assessment of student learning I)Rey-ra Mora
 
stages of test construction
stages of test constructionstages of test construction
stages of test constructionirshad narejo
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and ReliabilityMaury Martinez
 
Validity of test
Validity of testValidity of test
Validity of testSarat Rout
 
Norm-referenced & Criterion-referenced Tests
Norm-referenced & Criterion-referenced TestsNorm-referenced & Criterion-referenced Tests
Norm-referenced & Criterion-referenced TestsFariba Chamani
 
Reliability and its types: Split half method and test retest methods
Reliability and its types: Split half method and test retest methodsReliability and its types: Split half method and test retest methods
Reliability and its types: Split half method and test retest methodsAamir Hussain
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Maheen Iftikhar
 
Characteristics of a good assessment tool
Characteristics of a good assessment toolCharacteristics of a good assessment tool
Characteristics of a good assessment toolKiranMalik37
 
Meaning of Test, Testing and Evaluation
Meaning of Test, Testing and EvaluationMeaning of Test, Testing and Evaluation
Meaning of Test, Testing and EvaluationDr. Amjad Ali Arain
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxsarat68
 

Was ist angesagt? (20)

validity its types and importance
validity its types and importancevalidity its types and importance
validity its types and importance
 
Reliability (assessment of student learning I)
Reliability (assessment of student learning I)Reliability (assessment of student learning I)
Reliability (assessment of student learning I)
 
Standardized and non standardized tests (1)
Standardized and non standardized tests (1)Standardized and non standardized tests (1)
Standardized and non standardized tests (1)
 
Reliability
ReliabilityReliability
Reliability
 
OBJECTIVITY OF TESTS ppt.pptx
OBJECTIVITY OF TESTS ppt.pptxOBJECTIVITY OF TESTS ppt.pptx
OBJECTIVITY OF TESTS ppt.pptx
 
stages of test construction
stages of test constructionstages of test construction
stages of test construction
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and Reliability
 
Validity of test
Validity of testValidity of test
Validity of test
 
Types of test items
Types of test itemsTypes of test items
Types of test items
 
Test construction
Test constructionTest construction
Test construction
 
Norm-referenced & Criterion-referenced Tests
Norm-referenced & Criterion-referenced TestsNorm-referenced & Criterion-referenced Tests
Norm-referenced & Criterion-referenced Tests
 
Essay Type Test
Essay Type TestEssay Type Test
Essay Type Test
 
Week 8 & 9 - Validity and Reliability
Week 8 & 9 - Validity and ReliabilityWeek 8 & 9 - Validity and Reliability
Week 8 & 9 - Validity and Reliability
 
Reliability and its types: Split half method and test retest methods
Reliability and its types: Split half method and test retest methodsReliability and its types: Split half method and test retest methods
Reliability and its types: Split half method and test retest methods
 
Objective Tests
Objective TestsObjective Tests
Objective Tests
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.
 
Characteristics of a good assessment tool
Characteristics of a good assessment toolCharacteristics of a good assessment tool
Characteristics of a good assessment tool
 
Meaning of Test, Testing and Evaluation
Meaning of Test, Testing and EvaluationMeaning of Test, Testing and Evaluation
Meaning of Test, Testing and Evaluation
 
Construction of Test
Construction of TestConstruction of Test
Construction of Test
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptx
 

Ähnlich wie Reliability and validity are key to good tests

Evaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.pptEvaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.pptCityComputers3
 
unit 9 measurements presentation- short.ppt
unit 9 measurements presentation- short.pptunit 9 measurements presentation- short.ppt
unit 9 measurements presentation- short.pptMitikuTeka1
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Linejan
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validityshefali84
 
Measurement and scaling
Measurement and scalingMeasurement and scaling
Measurement and scalingBalaji P
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxmecklenburgstrelitzh
 
reliablity and validity in social sciences research
reliablity and validity  in social sciences researchreliablity and validity  in social sciences research
reliablity and validity in social sciences researchSourabh Sharma
 
D8 and d9 personality test development 10 2007-posting
D8 and d9 personality test development 10 2007-postingD8 and d9 personality test development 10 2007-posting
D8 and d9 personality test development 10 2007-postingBlessed Santos
 
Session 2 2018
Session 2 2018Session 2 2018
Session 2 2018Sue Hines
 
Chapter 5
Chapter 5Chapter 5
Chapter 5jbnx
 
Ag Extn.504 :- RESEARCH METHODS IN BEHAVIOURAL SCIENCE
Ag Extn.504 :-  RESEARCH METHODS IN BEHAVIOURAL SCIENCE  Ag Extn.504 :-  RESEARCH METHODS IN BEHAVIOURAL SCIENCE
Ag Extn.504 :- RESEARCH METHODS IN BEHAVIOURAL SCIENCE Pradip Limbani
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Tahere Bakhshi
 
Validity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesValidity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesMohammadRabbani18
 
Validity.pptx
Validity.pptxValidity.pptx
Validity.pptxrupasi13
 
Questionnaire and Instrument validity
Questionnaire and Instrument validityQuestionnaire and Instrument validity
Questionnaire and Instrument validitymdanaee
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test Arash Yazdani
 

Ähnlich wie Reliability and validity are key to good tests (20)

Evaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.pptEvaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.ppt
 
Reliability
ReliabilityReliability
Reliability
 
unit 9 measurements presentation- short.ppt
unit 9 measurements presentation- short.pptunit 9 measurements presentation- short.ppt
unit 9 measurements presentation- short.ppt
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validity
 
Measurement and scaling
Measurement and scalingMeasurement and scaling
Measurement and scaling
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
 
reliablity and validity in social sciences research
reliablity and validity  in social sciences researchreliablity and validity  in social sciences research
reliablity and validity in social sciences research
 
D8 and d9 personality test development 10 2007-posting
D8 and d9 personality test development 10 2007-postingD8 and d9 personality test development 10 2007-posting
D8 and d9 personality test development 10 2007-posting
 
Session 2 2018
Session 2 2018Session 2 2018
Session 2 2018
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Ag Extn.504 :- RESEARCH METHODS IN BEHAVIOURAL SCIENCE
Ag Extn.504 :-  RESEARCH METHODS IN BEHAVIOURAL SCIENCE  Ag Extn.504 :-  RESEARCH METHODS IN BEHAVIOURAL SCIENCE
Ag Extn.504 :- RESEARCH METHODS IN BEHAVIOURAL SCIENCE
 
Rep
RepRep
Rep
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)
 
Validity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesValidity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their Types
 
Monika seminar
Monika seminarMonika seminar
Monika seminar
 
Monika seminar
Monika seminarMonika seminar
Monika seminar
 
Validity.pptx
Validity.pptxValidity.pptx
Validity.pptx
 
Questionnaire and Instrument validity
Questionnaire and Instrument validityQuestionnaire and Instrument validity
Questionnaire and Instrument validity
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
 

Reliability and validity are key to good tests

  • 1. A. RELIABILITYA. RELIABILITY CHARACTERISTICS OF ACHARACTERISTICS OF A GOOD TESTGOOD TEST
  • 2. ReliabilityReliability • Reliability is synonymous with consistency. It is the degree to which test scores for an individual test taker or group of test takers are consistent over repeated applications. • No psychological test is completely consistent, however, a measurement that is unreliable is worthless.
  • 3. Would you keep using these measurement tools? The consistency of test scores is critically important in determining whether a test can provide good measurement.
  • 4. When someone says you are a ‘reliable’ person, what do they really mean? Are you a reliable person? 
  • 5. Reliability (cont.)Reliability (cont.) * Because no unit of measurement is exact, any time you measure something (observed score), you are really measuring two things. 1. True Score - the amount of observed score that truly represents what you are intending to measure. 2. Error Component - the amount of other variables that can impact the observed score Observed Test Score = True Score + Errors of Measurement
  • 6. Measurement ErrorMeasurement Error • Any fluctuation in test scores that results from factors related to the measurement process that are irrelevant to what is being measured. • The difference between the observed score and the true score is called the error score. S true = S observed - S error
  • 7. Measurement Error is Reduced By: - Writing items clearly - Making instructions easily understood - Adhering to proper test administration - Providing consistent scoring
  • 8. Determining ReliabilityDetermining Reliability • There are several ways that measuring reliability can be determined, depending on the type of measurement the supporting data required. They include: - Internal Consistency - Test-retest Reliability - Inter rater Reliability - Split-half Methods - Odd-even Reliability - Alternate Forms Methods
  • 9. Internal ConsistencyInternal Consistency • Measures the reliability of a test solely on the number of items on the test and the inter correlation among the items. Therefore, it compares each item to every other item. Cronbach’s Alpha: .80 to .95 (Excellent) .70 to .80 (Very Good) .60 to .70 (Satisfactory) <.60 (Suspect)
  • 10. Split Half & Odd-Even ReliabilitySplit Half & Odd-Even Reliability Split Half - refers to determining a correlation between the first half of the measurement and the second half of the measurement (i.e., we would expect answers to the first half to be similar to the second half). Odd-Even - refers to the correlation between even items and odd items of a measurement tool. • In this sense, we are using a single test to create two tests, eliminating the need for additional items and multiple administrations. • Since in both of these types only 1 administration is needed and the groups are determined by the internal components of the test, it is referred to as an internal consistency measure.
  • 11. Split-half reliability [error due to differences in item content between the halves of the test] • Typically, responses on odd versus even items are employed • Correlate total scores on odd items with the scores obtained on even items Person Odd Even 1 36 43 2 44 40 3 42 37 4 33 40 1 100 50 pairs
  • 12. Test-retest ReliabilityTest-retest Reliability • Test-retest reliability is usually measured by computing the correlation coefficient between scores of two administrations.
  • 13. Test-retest Reliability (cont.)Test-retest Reliability (cont.) • The amount of time allowed between measures is critical. • The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time. • Optimum time between administrations is 2 to 4 weeks. • The rationale behind this method is that the difference between the scores of the test and the retest should be due to measurement solely.
  • 14. Inter rater ReliabilityInter rater Reliability • Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.
  • 15. Inter rater Reliability (cont.)Inter rater Reliability (cont.) • For some scales it is important to assess interrater reliability. • Interrater reliability means that if two different raters scored the scale using the scoring rules, they should attain the same result. • Interrater reliability is usually measured by computing the correlation coefficient between the scores of two raters for the set of respondents. • Here the criterion of acceptability is pretty high (e.g., a correlation of at least .9), but what is considered acceptable will vary from situation to situation.
  • 16. Parallel/Alternate Forms MethodParallel/Alternate Forms Method Parallel/Alternate Forms Method - refers to the administration of two alternate forms of the same measurement device and then comparing the scores. • Both forms are administered to the same person and the scores are correlated. If the two produce the same results, then the instrument is considered reliable.
  • 17. Parallel/Alternate Forms Method (cont.)Parallel/Alternate Forms Method (cont.) • A correlation between these two forms is computed just as the test-retest method. Advantages • Eliminates the problem of memory effect. • Reactivity effects (i.e., experience of taking the test) are also partially controlled.
  • 18. Factors Affecting ReliabilityFactors Affecting Reliability • Administrator Factors • Number of Items on the instrument • The Instrument Taker • Heterogeneity of the Items • Heterogeneity of the Group Members • Length of Time between Test and Retest
  • 19. How High Should Reliability Be?How High Should Reliability Be? • A highly reliable test is always preferable to a test with lower reliability. . 80 > greater (Excellent) .70 to .80 (Very Good) .60 to .70 (Satisfactory) <.60 (Suspect) • A reliability coefficient of .80 indicates that 20% of the variability in test scores is due to measurement error.
  • 20. Reliability deals with the consistency. Reliability is the quality that guarantees us that we will get similar results when conducting the same test on the same population every time. Consider this ruler…
  • 21. Now compare this ruler… With this one…
  • 22. Each ruler will give the same answer each time… But this one will be wrong each time…
  • 23. Each ruler is reliable… But reliability doesn‘t mean much when it is wrong…
  • 24. So, not only do we require reliability… We also need…
  • 26. VALIDITY Validity deals with the accuracy of the measurement
  • 27. Validity  Depends on the PURPOSE  E.g. a ruler may be a valid measuring device for length, but isn’t very valid for measuring volume  Measuring what ‘it’ is supposed to  Matter of degree (how valid?)  Specific to a particular purpose!  Learning outcomes 1. Content coverage (relevance?) 2. Level & type of student engagement (cognitive, affective, psychomotor) – appropriate?
  • 28. Types of validity measures  Face validity  Construct validity  Content validity  Criterion validity
  • 29. Face Validity Does it appear to measure what it is supposed to measure? Example: Let’s say you are interested in measuring, ‘Propensity towards violence and aggression’. By simply looking at the following items, state which ones qualify to measure the variable of interest:  Have you been arrested?  Have you been involved in physical fighting?  Do you get angry easily?  Do you sleep with your socks on?  Is it hard to control your anger?  Do you enjoy playing sports?
  • 30. Construct Validity  Does the test measure the ‘human’ theoretical construct or trait.  Examples  Mathematical reasoning  Verbal reasoning or fluency  Musical ability  Spatial ability  Motivation  Applicable to authentic assessment  Each construct is broken down into its component parts  E.g. ‘motivation’ can be broken down to:  Interest  Attention span  Hours spent  Assignments undertaken and submitted, etc. All of these sub-constructs put together – measure ‘motivation’
  • 31. Content Validity How well elements of the test relate to the content domain? How closely content of questions in the test relates to content of the curriculum? Directly relates to instructional objectives and the fulfillment of the same! Major concernfor achievement tests (where content is emphasized) Can you test students on things they have not been taught?
  • 32. How to establish Content Validity?  Instructional objectives (looking at your list)  Table of Specification  E.g.  At the end of the chapter, the student will be able to do the following: 1. Explain what ‘stars’ are 2. Discuss the type of stars and galaxies in our universe 3. Categorize different constellations by looking at the stars 4. Differentiate between our stars, the sun, and all other stars
  • 33. Categories of Performance (Mental Skills) Content areas Knowledge Comprehension Analysis Total 1. What are ‘stars’? 2. Our star, the Sun 3. Constellations 4. Galaxies Total Grand Total Table of Specification (An Example)
  • 34. Criterion Validity The degree to which content on a test (predictor) correlates with performance on relevant criterion measures (concrete criterion in the "real" world?) If they do correlate highly, it means that the test (predictor) is a valid one! E.g. if you taught skills relating to ‘public speaking’ and had students do a test on it, the test can be validated by looking at how it relates to actual performance (public speaking) of students inside or outside of the classroom
  • 35. Factors that can lower Validity  Unclear directions  Difficult reading vocabulary and sentence structure  Ambiguity in statements  Inadequate time limits  Inappropriate level of difficulty  Poorly constructed test items  Test items inappropriate for the outcomes being measured  Tests that are too short  Improper arrangement of items (complex to easy?)  Identifiable patterns of answers  Teaching  Administration and scoring  Students  Nature of criterion
  • 36. Validity and Reliability Neither Valid nor Reliable Reliable but not Valid Valid & Reliable Fairly Valid but not very Reliable Think in terms of ‘the purpose of tests’ and the ‘consistency’ with which the purpose is fulfilled/met
  • 37. Objectivity the state of being fair, without bias or external influence. if the test is marked by different people, the score will be the same . In other words, marking process should not be affected by the marking person's personality. Not influenced by emotion or personal prejudice. Based on observable phenomena; presented factually: an objective appraisal. The questions and answers should be clear
  • 38.  measures an individual's characteristics in a way that is independent of rater’s bias or the examiner's own beliefs gauges the test taker's conscious thoughts and feelings without regard to the test administrator's beliefs or biases. help greatly in determining the test taker's personality.
  • 39. Understanding Norms a list of scores and corresponding percentile ranks, standard scores, or other transformed scores of a group of examinees on whom a test was standardized. In a psychometric context, norms are the test performance data of a particular group of test takers that are designed for use as a reference for evaluating or interpreting individual test scores” (Cohen & Swerdlik, 2002, p. 100).
  • 40. TYPES OF NORMS •Percentiles - refer to a distribution divided into 100 equal parts. - refer to the score at or below which a specific percentage of scores fall. Ex. A student got 90% rank of NAT exam. What does this mean?
  • 41. It means that 90% of his classmates scored lower than his score or 10% of his classmates got score above his score.
  • 42. Age Norms (age-equivalent scores) –“indicate the average performance of different samples of test takers who were at various ages at the time the test was administered” (Cohen & Swerdlik, 2002, p. 105). Grade Norms –Used to indicate the average test performance of testtakers in a specific grade. –Based on a ten month scale, refers to grade and month (e.g., 7.3 is equivalent to seventh grade, third month).
  • 43. •National Norms –Derived from a standardization sample nationally representative of the population of interest. Subgroup Norms –Are created when narrowly defined groups are sampled. Ex. •Socioeconomic status •Handedness •Education level
  • 44. Local Norms –Are derived from the local population’s performance on a measure. - Typically created locally (i.e., by guidance counselor, personnel director, etc.) Fixed Reference Group Scoring Systems •Calculation of test scores is based on a fixed reference group that was tested in the past.
  • 45. •Norm referenced tests consider the individual’s score relative to the scores of testtakers in the normative sample. •Criterion Referenced tests consider the individual’s score relative to a specified standard or criterion (cut score). –Licensure exams –Proficiency tests
  • 46. Item Analysis A name given to a variety of statistical techniques designed to analyze individual items on a test It involves examining class-wide performance on individual test items. It sometimes suggests why an item has not functioned effectively and how it might be improved A test composed of items revised and selected on the basis of item-analysis is almost certain to be more reliable than the one composed of an equal number of untested items.
  • 47. Difficulty index The proportion of students in class who got an item correct. The larger the proportion , the more students who have learned the content measured by the item
  • 48. Discrimination index A basic measure of the validity of an item. A measure of an item’s ability to discriminate between those who scored high on the total test and those who scored low. It can be interpreted as an indication of the extent to which overall knowledge of the content area or mastery of the skill is related to the response on an item
  • 49. Analysis of response options/distracter analysis In addition to examining the performance of a test item, teachers are often interested in examining the performance of individual distracters ( incorrect answer options) on multiple-choice items By calculating the proportion of students who chose each answer option, teachers can identify which distracters are working and appear to be attractive to students who do not know the correct answer, and which distracters are simply taking up space and not being chosen by many students
  • 50. To eliminate blind guessing which results in a correct answer purely by chance (which hurts the validity of a test item), teachers want as many plausible distracters as is feasible.
  • 51. The process of item analysis 1. Arrange the test scores from highest to lowest 2. Select the criterion groups Identify a High group and a Low group. The High group is the highest-scoring 27% of the group and the Low group is the lowest scoring 27% 27% of the examinees is called the criterion group. It provides the best compromise between two desirable but inconsistent aims: to make the extreme groups as large as possible and as different as possible then we can say with confidence that those in the High group are superior in the ability measured by the test than those in the Low group.
  • 52. 3. For each item, count the number of examinees in the High group who have correct responses. Do a separate, similar procedure for the low group 4. Solve for the difficulty index of each item  The larger the value of the index, the easier the item.  The smaller the value, the more difficult is the item.  Scale for interpreting the difficulty index of an item Below 0.25 item is very difficult 0.25 – 0.75 item is of average difficulty or item is rightly difficult Above 0.75 item is very easy
  • 53. Example: Item analysis 1. Count and arrange the scores from highest to lowest.  Ex. n=43 scores 2. Calculate the criterion group (N) which is 27% of the total number of scores.  Ex. N=27% of 43= (0.27)(43) = 12 3. Take 12 scores from the highest down and take 12 scores from the lowest up, call these High group and Low group respectively. 4. Tabulate the number of responses for each options from the high and low groups for that particular item under analysis.
  • 54. 5. Solve for the difficulty index of each item  The larger the value of the index, the easier the item. The smaller, the more difficult.  Scale for interpreting the difficulty index of an item Below 0.25 item is very difficult 0.25 – 0.75 item is of average difficulty or item is rightly difficult Above 0.75 item is very easy
  • 55. A B C D* E Total Upper Group 1 1 0 9 1 12 Lower Group 3 1 4 4 0 12 Ex: Item # 5 of the Multiple Choice test, D is the correct option.
  • 56. Idis Index Description Interpretation 0.40 – 1.0 High The item is very good 0.30 -0.39 Moderate Reasonably good, can be improved 0.20 – 0.29 Moderate In need of improvement < 0.20 Low Poor, to be discarded The following can be used to interpret the index of discrimination.
  • 57. Idis Idif Item category High Easy Good High Easy/difficult Fair Moderate Easy/difficult Fair High/moderate Easy/difficult Fair low At any level Poor (Discard the item) •Interpreting the results by giving value judgment
  • 58. Index of difficulty = (Hc + Lc) / 2N = (9+4)/2(12)=.54 ----the item is rightly difficult Index of discrimination = (Hc –Lc)/N=(9- 4)/12=.42 ---- high index of discrimination ---- the item has the power to discriminate Hence, item number 5 has to be retained. Distracter analysis: A and C are good distracters
  • 59. Thank you and God bless us all!