CALL ON ➥8923113531 🔝Call Girls Hazratganj Lucknow best sexual service Online
Rree measurement-larry-d3
1. Larry D. Gruppen, Ph.D.
University of Michigan
From Concepts to Data:
Conceptualization,
Operationalization, and
in Educational Research
Measurement
2. Objectives
• Identify key research
design issues
• Wrestle with the
complexities of
educational measurement
• Explain the concepts of
reliability and validity in
educational measurement
• Apply criteria for
measurement quality
when conducting
educational research
3. Agenda
• A brief nod to design
• From theory to measurement
• Criteria for measurement quality
– Reliability
– Validity
• Application: analyze an article
4. Guiding Principles for
Scientific Research in Education
1. Question: pose significant question that can be
investigated empirically
2. Theory: link research to relevant theory
3. Methods: use methods that permit direct investigation of
the question
4. Reasoning: provide coherent, explicit chain of reasoning
5. Replicate and generalize across studies
6. Disclose research to encourage professional scrutiny and
critique
5. Study design
• Study design consists of:
– Your measurement method(s)
– The participants and how they are assigned
– The intervention
– The sequence and timing of measurements
and interventions
6. Comparison Group
• Pre-post design - compare intervention group to
itself
• Non-equivalent control group design - compare
intervention group to an existing group
• Randomized control group design - compare to
equivalent controls
7. Overview of Study Designs
• Symbols
– Each line represents a group.
– x = Intervention (e.g. treatment)
– O1, O2, O3…= Observation (measurement) at
Time 1, Time 2, Time 3, etc.
– R = Random assignment
19. The Challenge of Educational
Measurement
• Almost all of the constructs we are interested in
are buried inside the individual
• Measurement depends on transforming these
internal states, events, capabilities, etc. into
something observable
• Making them observable may alter the thing we
are measuring
20. Examples of Measurement Methods
• Tests (knowledge, performance): defined
response, constructed response, simulations
• Questionnaires (attitudes, beliefs, preferences):
rating scales, checklists, open-ended responses
• Observations (performance, skills): tasks
(varying degrees of authenticity), problems, real-
world behaviors, records (documents)
22. Types of Reliability
• Stability (produces the same results with repeated measurements
over time):
– Test-retest
– Correlation between scores at 2 times
• Equivalence/Internal Consistency (produces same results with
parallel items on alternate forms):
– Alternate forms; split-half; Kuder-Richardson; Chronbach’s alpha
– Correlation between scores on different forms; Calculate
coefficient alpha (a)
• Consistency (produces the same results with different observers or
raters):
– Inter-rater agreement
– Correlation between scores from different raters; kappa
coefficient
23. Validity
• Refers to the accuracy of inferences based on
data obtained from measurement
• Technically, measures aren’t valid, inferences
are
• No such thing as validity in the abstract: the key
issue is ‘valid’ for what inference
• Want to reduce systematic, non-random error
• Unreliability lowers correlations, reducing validity
claims
24. Conventional View of Validity
• Face validity: logical link between items and purpose—
makes sense on the surface
• Content validity: items cover the range of meaning
included in the construct or domain. Expert judgment
• Criterion validity: relationship between performance on
one measurement and performance on another (or
actual behavior) Concurrent and Predictive Correlation
coefficients
• Construct validity: directly connect measurement with
theory. Allows interpretation of empirical evidence in
terms of theoretical relationships. Based on weight of
evidence. Convergent and discriminant evidence.
Multitrait-MultiMethod Analysis (MTMM)
25. Unified View of Construct Validity
(Messick S, Amer Psych, 1995)
• Validity is not a property of an instrument but rather of
the meaning of the scores. Must be considered
holistically.
• 6 Aspects of Construct Validity Evidence
– Content—content relevance & representativeness
– Substantive—theoretical rationale for observed consistencies in
test responses
– Structural—fidelity of scoring structure to structure of construct
domain
– Generalizability—generalization to the population and across
populations
– External—convergent and discriminant evidence
– Consequential—intended and unintended consequences of
score interpretation; social consequence of assessment
(fairness, justice)
26. Finding Measurement Instruments
• Scan the engineering education literature (obviously)
• Email engineering ed researchers (use the network)
• Examine literature for instruments used in prior studies
• General education/social science instrument databases
– Buros Institute of Mental Measurements (Mental
Measurement Yearbook, Tests in Print)
http://buros.unl.edu/buros/jsp/search.jsp
– ERIC databases http://www.eric.ed.gov/
– Educational Testing Service Test Collection
http://www.ets.org/testcoll/index.html
• Construct your own (last resort!)
– Get some expert consultation (test writing, survey
design, questionnaire construction, etc.)
27. Example
• In your groups, analyze the Steif & Dantzler
statics concept inventory article. Look for:
– Theoretical framework
– Constructs used in the study
– How constructs were operationalized
– Measurement process
• Attention to reliability and validity
28. References
• Campbell DT, Stanley JC. Experimental and quasi-
experimental designs for research. Chicago: Rand
McNally; 1969.
• Cook, T.D. and Campbell, D.T. (1979). Quasi-
Experimentation: Design and Analysis for Field Settings.
Rand McNally, Chicago, Illinois.
• Messick S. Validity of psychological assessment:
validation of inferences from persons' responses and
performances as scientific inquiry into score meaning.
American Psychologist. 1995;50:741-749.
• Messick S. Validity. In: Linn RL, ed. Educational
measurement. 3rd ed. New York: American Council on
Education & Macmillan; 1989:13-103.
Hinweis der Redaktion
90 minute session
Steif analysis = 40 min?
Learning (cognitive theory, constructivist theory, social cognitive plus some current interesting things that derived from each, like expert novice, transfer issues, ?)
Motivation (probably goal theory, self-efficacy, expectancy value, self-dtermination, maybe something on negative motivation like anxiety)
Developmental (probably cognitive development a la Perry, epistemological development, Baxter-Magolda etc.) Individual differences (prior knowledge, development, motivation, strategy repertoirs and self-regulation, styles, etc.)
Highlight 3. Methods as the item for this session - how we get the data to permit ‘direct investigation’
Also relevant to 1. “Empirically”
A study design consists of decisions about several issues and the arrangment or timing of events in the study.
What you are measuring stems quite directly from the hypothesis or research question, which identifies the outcome or phenomenon of interest (learning or time use or cost, etc.). We‘ll address this is more detail in the next topic.
The selection and assignment of participants also should follow from the hypothesis, but frequently, we do research on ‘convenience samples’ of whatever students we can get access to, whether they are appropriate or not.
The intervention has to be defined quite clearly, both in terms of activities and timing. This is particularly true for complex educational interventions. Going back to our videotaped lecture example, we need to define whether the intervention is defined as access to videotapes of all lectures, access to those of a specific course, or to that of a specific lecture.
The sequence and timing of measurements and intervention(s) is another critical decision. Measuring outcomes immediately after the intervention is most likely to show an impact, but a delayed measurement will more accurately assess how lasting the impact might be. You can, of course, do multiple measurements at various times, but all these need to be defined as part of the study design.
The whole issue of randomization is the other problem that plagues most medical education studies.
The most common, and often unrecognized manifestation of this is in the selection of students for the study. Not only are medical students a highly (and nonrandomly) selected population to begin with, but our studies often take students who self-select for specific educational activities or elect to participate or not participate on a non-random basis.
The other problem with randomization is the one I just mentioned in the previous slide - that of non-random assignment of students. In our videotape example, we have the problem of students self-selecting to view the videotapes or not. It is feasible to imagine random assignment of students to view the tapes or not, but that creates ethical as well as pragmatic problems.
Too many education researchers content themselves with a simple description of a program or an intervention or an observation, supported by some data collected from one group of students at one point in time. While this kind of research provides some useful information, the absence of a comparison group prevents us from being able to fully interpret the value of the intervention. We need to compare these results to SOMETHING and the better the quality of that ‘something,’ the better the study design.
One fairly simple comparison group is the same students prior to the intervention. Although this isn‘t the strongest design, it is better than nothing and often feasible to do.
Another design would be to find a comparison group that, while not entirely equivalent to the intervention group, serves as a useful point of reference. An example of this would be to compare the intervention students to students at the same point in the curriculum from previous years. We don‘t know all the ways in which the two cohorts might differ, besides the intervetion, so it isn’t problem-free, but again, it provides a useful comparison.
The best design would be to randomly assign students to the control andintervention conditions. While scientifically strong, it is seldom pragmatically feasible.
Strengths
Useful in exploring new problems
Developing ideas or devices
Weaknesses
No control and no internal validity
No ability to make comparisons (Conclusions can only be impressionistic or imprecise)(Using historical or standardized populations not wise)
Strengths
No effect of pretesting
Useful when pretests are unavailable, inconvenient or too expensive
Also, useful when participant anonymity must be maintained
Weaknesses
No ability to measure of the effect of the intervention (treatment)
Controls for but can not estimate the effects of maturation and history
Possible selection differences (groups could be different in some fundamental way)
Reactive effects of experimental procedures?
Strengths
Compares the performance of the same group
Controls for selection (if same participants)
Controls for mortality (if same participants)
Weaknesses
No assurance that the intervention is the only factor in the difference between O1 and O2
Threats to validity
History
Maturation
Testing effects
Statistical regression (for extreme groups)
Reactive effects of experimental procedures?
Strengths
Good internal validity
Control groups allow us to estimate the effects of
History
Maturation
Testing effects
Controls mortality effects (by checking pre and post measures)
Weaknesses
Possible selection differences (groups could be different in some fundamental way)
Reactive effects of experimental procedures?
Theory—Conceptualization by specifying precisely what we mean by a term (e.g., learning, expertise, socialization, motivation, etc.)
Constructs: theoretical creations based on observations but which cannot be observed directly or indirectly. Hypothetical; abstract, defined concepts. Created by scientists. Come from theory. E.g. learning, problem solving, critical thinking, cognitive development, attribution, locus of control.
Operational definition: spells out precisely how the concept will be measured - what are the variables. A description of operations that will be used to measure the concept. In education, these typically depend on some behavior on the part of the learners - answering questions on a survey, making presentations, solving problems, working in groups, etc. It must be observable [remember - “empirical”]
Measurement: this critical step is central to qualitative and quantitative research. It is more apparent in quantitative research, but the issues, challenges, and decisions are analogous. We will focus on quantitative applications and examples, but keep in mind that the principles also apply to quantitative research methods.
So we will spend our session today looking at principles of educational measurement.
Scenario: You’ve noticed that students vary considerably in how they react to feedback in the form of grades or written evaluations. Some take any criticism as a personal attack whereas others seem to be immune to any efforts you make to tell them they need to improve their performance. Like a good educational researcher, you investigate what the literature has to say on the matter and stumble across a theoretical framework called “attribution Theory” that seems relevant.
Describe attribution theory
Examples of attributions: driving and someone blows their horn at you or flips you the finger - intrinsic or extrinsic attribution - my problem or his?
Golf shots: good ones are due to my ability, bad ones are due to luck
Scenario: You’ve noticed that students vary considerably in how they react to feedback in the form of grades or written evaluations. Some take any criticism as a personal attack whereas others seem to be immune to any efforts you make to tell them they need to improve their performance. Like a good educational researcher, you investigate what the literature has to say on the matter and stumble across a theoretical framework called “attribution Theory” that seems relevant.
Describe attribution theory
Examples of attributions: driving and someone blows their horn at you or flips you the finger - intrinsic or extrinsic attribution - my problem or his?
Golf shots: good ones are due to my ability, bad ones are due to luck
The constructs are generally internal, espcially in constructivist and cognitive theoretical frameworks. Behaviorism is attractive in that these internal states don‘t matter.
Examples:Stability - administer your final exam in thermodynamics on the last day of class and re-administer it a day later to the same people. Would expect the results to be the same. If you did this on the first day of class and again on the last, you‘d expect the scores to change.
Equivalence - two bathroom scales should give you the same weight in the morning. Two versions of final exam that test the same content should as well.
Internal consistency - to what extent are all the items on the exam measuring the same construct - thermodynamics. If some are on thermodynamics and others on hydrodynamics, the test is not internally consistent and you should derive two scores from it rather than one.
Content: determining boundaries of the construct domain. Determining the knowledge, skills, attitudes, motives and other attributes to be revealed by the measurement tasks. Addressed by means of job analysis, task analysis, curriculum analysis, domain theory. Must also attend to the representativeness of the tasks selected for assessment.
Substantive: Emphasizes role of substantive theories and process modeling in identifying the domain processes to be revealed in assessment tasks. Derived from think aloud protocols, correlations patterns among part scores, modeling of task performance.
Structural: Theory should not only guide selection of relevant tasks (substantive) but also the development of scoring criteria and rubrics.
Generalizability: Interpretations not limited to the sample of assessed tasks but be broadly generalizable to the construct domain.
External—MTMM
Consequential: Social and value-related issues. Should accrue evidence of purported positive consequences. Primary issue is that any negative impact should not be derived from any source of test invalidity.
Debrief in general asking for volunteers to comment on each of the four dimensions. Theory should be challenging in the sense that it is not apparent in the article.