2. VALIDATION TASK
To establish whether the interpretation and uses
of the VSTEP test scores were valid for measuring the
English language competence of test-takers
from level 3 to level 5 on the Vietnamese English
language competence scale
225/12/2015
To establish whether the interpretation and uses
of the VSTEP test scores were valid for measuring the
English language competence of test-takers
from level 3 to level 5 on the Vietnamese English
language competence scale
3. VALIDITY & VALIDATION
Validity is an integrated evaluative judgment of the degree to
which empirical evidence and theoretical rationales support the
adequacy and appropriateness of inferences and actions based
on test scores or other models of assessment.
(Messick, 1989)
325/12/2015
Validity is an integrated evaluative judgment of the degree to
which empirical evidence and theoretical rationales support the
adequacy and appropriateness of inferences and actions based
on test scores or other models of assessment.
(Messick, 1989)
Validation is to marshal evidence and arguments in support of,
or counter to, proposed interpretations and uses of test scores.
(Messick, 1989)
6. MESSICK (1989)’S ASPECTS OF VALIDITY
The content aspect
Content relevance
Representativeness
Technical quality
The substantive aspect
Theoretical rationales for observed consistencies in responses
Process of performance
Empirical evidence of process
625/12/2015
The content aspect
Content relevance
Representativeness
Technical quality
The substantive aspect
Theoretical rationales for observed consistencies in responses
Process of performance
Empirical evidence of process
7. MESSICK (1989)’S ASPECTS OF VALIDITY
The structural aspect
The fidelity of the scoring structure to the construct structure.
The generalizability aspect
The extent to which score properties and interpretations
generalize to and across groups, settings and tasks
Reliability
Content representativeness
725/12/2015
The structural aspect
The fidelity of the scoring structure to the construct structure.
The generalizability aspect
The extent to which score properties and interpretations
generalize to and across groups, settings and tasks
Reliability
Content representativeness
8. MESSICK (1989)’S ASPECTS OF VALIDITY
The external aspect
Convergent and discriminant evidence
Criterion relevance
Applied utility
The consequential aspect
Value implications as a basis for action/consequences
Bias
Fairness
825/12/2015
The external aspect
Convergent and discriminant evidence
Criterion relevance
Applied utility
The consequential aspect
Value implications as a basis for action/consequences
Bias
Fairness
9. MESSICK (1989)’S VALIDITY FRAMEWORK
Value
The most influential framework of validity
Criticisms
Abstract
Difficult to be done by a single researcher
No specific guidance for specific validation context
925/12/2015
Value
The most influential framework of validity
Criticisms
Abstract
Difficult to be done by a single researcher
No specific guidance for specific validation context
10. VALIDITY THEORIES
Kane (1992)’s and (2006)’s Validity Chapter
Argument-based Approach to Validation
Interpretive Argument
The network of inferences and assumptions
Validity Argument
Logical evidence
Empirical evidence
The
Development
Stage
1025/12/2015
Kane (1992)’s and (2006)’s Validity Chapter
Argument-based Approach to Validation
Interpretive Argument
The network of inferences and assumptions
Validity Argument
Logical evidence
Empirical evidence
The
Appraisal
Stage
11. KANE (1992)’S VALIDITY FRAMEWORK
Values
The most practical, objective framework of validity
Unique interpretive argument, consistent validity argument
steps (Bachman, 2004)
Criticisms
No attention to the structural aspect (Messick, 1995)
Inadequate attention/method to policy context and
consequences of tests (McNamara, 2006).
1125/12/2015
Values
The most practical, objective framework of validity
Unique interpretive argument, consistent validity argument
steps (Bachman, 2004)
Criticisms
No attention to the structural aspect (Messick, 1995)
Inadequate attention/method to policy context and
consequences of tests (McNamara, 2006).
12. LANGUAGE TEST VALIDATION
Bachman (1990)’s framework, after Messick (1989)’s
Bachman (2004)’s framework, after Kane (1992)’s
1225/12/2015
Bachman (1990)’s framework, after Messick (1989)’s
Bachman (2004)’s framework, after Kane (1992)’s
14. 1. To what extent was the test content relevant to and
representative of the domain of English language ability?
2. To what extent was each sub-test successful in measuring
students’ English language ability?
3. How well did the test-takers’ test scores on the VSTEP
correlate with their test scores on the IELTS?
4. What were the consequences of the UEE English test
scores' interpretation and use?
VALIDATION QUESTIONS
1425/12/2015
1. To what extent was the test content relevant to and
representative of the domain of English language ability?
2. To what extent was each sub-test successful in measuring
students’ English language ability?
3. How well did the test-takers’ test scores on the VSTEP
correlate with their test scores on the IELTS?
4. What were the consequences of the UEE English test
scores' interpretation and use?
16. WINTERTemplate
RELEVANCE
• Topical content
• Typical behavior
• Underlying process
• Test specifications
01CONTENT
RELEVANCE
• Topical content
• Typical behavior
• Underlying process
• Test specifications
17. WINTERTemplate
01CONTENT
TECHNICAL QUALITY
Empirical Evidence
• difficulty level
• discriminating power
Expert Judgment
• readability level
• freedom of ambiguity/irrelevancy
• appropriateness of keyed answers & distractors
TECHNICAL QUALITY
Empirical Evidence
• difficulty level
• discriminating power
Expert Judgment
• readability level
• freedom of ambiguity/irrelevancy
• appropriateness of keyed answers & distractors
18. WINTERTemplate
REPRESENTATIVENESS
The breadth of the content specifications for a test should
reflect the breadth of the construct invoked in score
interpretation” (Messick, 1989, p. 35).
All essential components of the construct domain are
covered (Messick, 1994, p. 12).
01CONTENT
REPRESENTATIVENESS
The breadth of the content specifications for a test should
reflect the breadth of the construct invoked in score
interpretation” (Messick, 1989, p. 35).
All essential components of the construct domain are
covered (Messick, 1994, p. 12).
19. WINTERTemplate
01CONTENT
CONTENT ANALYSIS BY EXPERTS
• What knowledge and skills are needed to do each
item correctly?
• How relevant are the items to their assigned
objectives and domain?
Domain
• English secondary school curricula
• English program at the college
CONTENT ANALYSIS BY EXPERTS
• What knowledge and skills are needed to do each
item correctly?
• How relevant are the items to their assigned
objectives and domain?
Domain
• English secondary school curricula
• English program at the college
21. WINTERTemplate
01CONTENT
Item fit statistics
Smith (2004) suggested using item fit statistics to evaluate the
extent to which items tap into the same construct and place
test-takers in the same order.
- the extent to which the use of each item is consistent with the
way people have responded to the other items
- does the item rank order the individuals in a manner similar to
other items? (p. 106)
Smith (2004) argued that test-takers should be ranked
consistently by items measuring the same construct. If not, the
misfitting items to the Rasch model, i.e. the items that measure
a different construct, should be subject to revision or elimination
(p. 107).
Item fit statistics
Smith (2004) suggested using item fit statistics to evaluate the
extent to which items tap into the same construct and place
test-takers in the same order.
- the extent to which the use of each item is consistent with the
way people have responded to the other items
- does the item rank order the individuals in a manner similar to
other items? (p. 106)
Smith (2004) argued that test-takers should be ranked
consistently by items measuring the same construct. If not, the
misfitting items to the Rasch model, i.e. the items that measure
a different construct, should be subject to revision or elimination
(p. 107).
22. To what extent was the VSTEP sub-tests successful in
measuring students’ English language competence?
ITEM RESPONSE THEORY (RASCH MODEL)
item fit
item discrimination
item cluster
DISCRIPTIVE STATISTICS
choice response analysis
02SUBSTANTIVE & STRUCTURAL
25/12/2015 22
To what extent was the VSTEP sub-tests successful in
measuring students’ English language competence?
ITEM RESPONSE THEORY (RASCH MODEL)
item fit
item discrimination
item cluster
DISCRIPTIVE STATISTICS
choice response analysis
23. How well did the test-takers’ VSTEP overall and
sub-test scores correlate with the test-takers’
overall and sub-test IELTS scores?
03CRITERION-RELATED
25/12/2015 23
24. 04
• The value implications of score interpretation
• The actual and potential consequences of score
uses
(Messick, 1989)
FOCUS: on validity of test score interpretation and
use - construct under-representation or construct-
irrelevant variance
CONSEQUENCES
25/12/2015 24
• The value implications of score interpretation
• The actual and potential consequences of score
uses
(Messick, 1989)
FOCUS: on validity of test score interpretation and
use - construct under-representation or construct-
irrelevant variance
25. 04
Sources of evidence
• Content relevance and representativeness
• Item bias
• Technical quality of the test
• Expert judgment
CONSEQUENCES
25/12/2015 25
Sources of evidence
• Content relevance and representativeness
• Item bias
• Technical quality of the test
• Expert judgment
26. References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards
for Educational and Psychological Testing. Washington, DC: Authors.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards
for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
Andrich, D., & Mercer, A. (1997). International perspectives on selection methods of entry into higher education. Canberra: National Board of
Employment, Education and Training [and] Higher Education Council.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press.
Berk, R. A. (1980). Item Analysis. In R. A. Berk (Ed.), Criterion-referenced measurement: the state of the art. Baltimore and London: The Johns Hopkins
University Press.
Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (pp. 621-694). Washington, D.C.: American Council on Education.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, California: Sage Publications.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17-64). Westport, CT: American Council on
Education/Praeger.
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(4), 635-694.
McNamara, T., & Roever, C. (2006). Language testing: the social dimension. Malden, MA: Blackwell Publishing.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: American Council on Education/Macmillan.
MOET. (2006). Secondary Education Curriculum: English. Hanoi: Education Publisher.
Moss, P. A. (2007). Reconstructing Validity. Educational Researcher, 36(8), 470-476.
Popham, W. J. (1997). Consequential Validity: Right Concern--Wrong Concept. Educational Measurement: Issues and Practice, 16(2), 9-13.
Purpura, J. E. (1999). Learner strategy use and performance on language tests : a structural equation modeling approach. Cambridge: Cambridge
University Press.
Smith, E. V. (2004). Evidence for Reliability of Measures and Validity of Measure Interpretation: A Rasch Measurement Perspective. In E. V. Smith & R.
M. Smith (Eds.), Introduction to Rasch Measurement: Theory, Models and Applications. Maple Grove: JAM Press.
Wu, M. L., Adams, R. J., & Haldane, S. (2008). ConQuest: Generalised Item Response Modelling Software [computer program]. Camberwell: Australian
Council for Educational Research.
2625/12/2015
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards
for Educational and Psychological Testing. Washington, DC: Authors.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards
for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
Andrich, D., & Mercer, A. (1997). International perspectives on selection methods of entry into higher education. Canberra: National Board of
Employment, Education and Training [and] Higher Education Council.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press.
Berk, R. A. (1980). Item Analysis. In R. A. Berk (Ed.), Criterion-referenced measurement: the state of the art. Baltimore and London: The Johns Hopkins
University Press.
Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (pp. 621-694). Washington, D.C.: American Council on Education.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, California: Sage Publications.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17-64). Westport, CT: American Council on
Education/Praeger.
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(4), 635-694.
McNamara, T., & Roever, C. (2006). Language testing: the social dimension. Malden, MA: Blackwell Publishing.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: American Council on Education/Macmillan.
MOET. (2006). Secondary Education Curriculum: English. Hanoi: Education Publisher.
Moss, P. A. (2007). Reconstructing Validity. Educational Researcher, 36(8), 470-476.
Popham, W. J. (1997). Consequential Validity: Right Concern--Wrong Concept. Educational Measurement: Issues and Practice, 16(2), 9-13.
Purpura, J. E. (1999). Learner strategy use and performance on language tests : a structural equation modeling approach. Cambridge: Cambridge
University Press.
Smith, E. V. (2004). Evidence for Reliability of Measures and Validity of Measure Interpretation: A Rasch Measurement Perspective. In E. V. Smith & R.
M. Smith (Eds.), Introduction to Rasch Measurement: Theory, Models and Applications. Maple Grove: JAM Press.
Wu, M. L., Adams, R. J., & Haldane, S. (2008). ConQuest: Generalised Item Response Modelling Software [computer program]. Camberwell: Australian
Council for Educational Research.