Performance Assessments with MFRM

1
Producing Unbiased Performance
Assessment Scores Using the Many-
Facet Rasch Model
Ross Brown, Ph.D.
Measurement Incorporated

2
Background
• For more than two decades, the MRA Division of MI has
used the many-facet Rasch model (MFRM) for the
analysis of client performance assessments.
• Using MFRM for performance assessments offers
benefits relating to measurement, fairness,
administration, resources, and security.

3
Other Session Objectives
• Understand the psychometric properties of a many-
facet Rasch measurement approach to performance
assessment scoring
• Understand how stakeholder concerns regarding a
MFRM approach can be addressed
• Understand the basics of setting up a performance
assessment for a MFRM analysis, as well as setting a
passing standard and equating that standard

4
Why Performance Assessments?
• Performance assessments complement written
examinations, allowing testing organizations to assess
candidates on higher-level decision-making abilities.
• Our clients often use performance assessments to
measure candidates’ abilities to apply skills such as
diagnosis, treatment, and management of complications
in a clinical context, replicating real-world patient
situations.

5
Performance Assessment Format
• Examiners rating candidate performance on
standardized protocols, i.e., hypothetical patient
scenarios, or candidates’ actual patients
• Candidates describe how they would diagnose and
treat.
• Some other permutations
• The methods we use for organizing and analyzing such
performance assessment can be used in different ways
and in different fields.

6
Benefits of MFRM: Fairness
• Different examiners have different levels of severity
when they assign ratings to candidate
• If candidate outcomes were determined based on raw
scores alone, the severity of individual examiners could
differentially affect candidates.
• MFRM allows for the severity of individual examiners to
be accounted for before candidate scores are
calculated.

7
Benefits of MFRM: Security
• Using MFR approach, different exam content (i.e.,
patient scenarios) can be used for different candidates,
reducing the likelihood that candidates will be able to
accurately disclose to other candidates information
about the exam content.

8
Benefits of MFRM: Fairness
• However, different exam content logically would have
different levels of difficulty.
• If candidate outcomes were determined using raw
scores, this differential difficulty could unfairly penalize
or benefit individual candidates.
• But calculating candidate scores using the MFRM takes
into account differences in the particular exam content
that different candidates are tested on.

9
Structuring a Performance Assessment
for MFRM Analysis
• Examiners interview candidates regarding exam
material such as standardized patient
• Examiners lead the discussion, asking pointed
questions about how candidates would do such things
as diagnose patients illnesses, treat patients, and
manage complications.

10
Structuring a Performance Assessment
• Examiners use a rating scale, typically with four points
on it, to assign ratings to candidates’ responses.
• Candidates rotate between examiners who assess them
on different protocols.
• Ratings are assigned to specific skills, related to the
exam materials, such as diagnosis, treatment,
management of complications.
• Therefore, you have examiners rating candidates’
performance on skills within protocols.

11
Linking Facet Elements
Facets of the performance assessment:
• Candidates
• Examiners
• Protocols
• Skills

12
Linking Facet Elements
• To quantify and account for differences in the severity of
individual examiners and the difficulty of individual
protocols, the performance assessment must be
carefully structured so that there is overlap of
examiners’ ratings on candidates and protocols.
• This overlap links the different facet elements and allows
for differences between individual elements to be
quantified and accounted for.

13
Benefits: Resources and Administration
• No adjustments necessary if all candidates perform the
same skills on all protocols and are evaluated by the
same examiners.
• Reality: This is usually too expensive or logistically
impossible.
• MRFM: Candidates interact with some examiners on
selected protocols; each candidate takes a parallel
examination form.

14
• The differences and biases in each of these examination
forms must be accounted for to make the candidate
ability estimates reasonably consistent, objective and
reproducible.
• Organizing a PA this way also affords benefits in terms
of the resources required to conduct the PA, and the
administration of the PA

15
• Candidates moved through several pairs of examiners
who assess them on several protocols.
• A lot of performance information is collected efficiently
as several candidates are assessed simultaneously.

16
• Like the regular Rasch model with only two facets
(persons and items), the MFRM produces candidate
ability estimates of known precision (error) and
reproducibility (reliability).
• Testing organizations can scale their performance
assessments so that they achieve the measurement
precision and reliability they desire with the resources
(time for administration, number of examiners) that they
have available.

17
Pnmijk = probability of person n being rated in category k by
examiner m on skill j in protocol i,
Pnmij(k-1) = probability of person n being rated in category
(k – 1) by examiner m on skill j in protocol i,
Bn = the ability of candidate n,
Sm = the severity of examiner m,
Ci = the difficulty of protocol i,
Dj = the difficulty of skill j, and
Fk = the difficulty of the step up from category (k – 1) to
Psychometric Model

18
Psychometric Model
• Probability of a performance: A function of the
difference between candidate ability and skill difficulty,
after adjustment for the severity of the examiner and the
difficulty of the protocol.
• If after adjustment, candidate's ability is higher, then the
probability of an acceptable performance is greater
than 50%.
• If after adjustment, skill difficulty is greater than the
ability of the candidate, the probability of achieving an
acceptable performance is less than 50%.

19
Psychometric Model:
Ordering Facet Elements
• Ordering of the candidates, examiners, protocols, and
skills on a linear scale provides a frame of reference for
understanding the relationship of the facets of the PA:
• Candidate ability (Bn) from highest to lowest
• Skill difficulty (Dj) from most to least difficult
• Examiner severity (Sm) from most to least severe
• Protocol difficulty (Ci) from most to least difficult.

20
Psychometric Model:
Sums of Ratings
• Ratings given by examiners are the basic units of
analysis.
• Skill difficulty is calculated from all ratings given to all
candidates by all examiners on the skill.
• Protocol difficulty includes all ratings given to all
candidates by all examiners on the protocol.
• Examiner severity includes the ratings given by the
examiner on all skills across all protocols to all
candidates encountered.

21
Psychometric Model: Logits
• Estimates are based on probability of performance
given the nature of the facets of the examination
encountered by a candidate.
• Log odds units or logits are used to construct an equal
interval scale.
• All facet element calibration estimates (candidate ability,
examiner severity, skill and/or protocol difficulty) are
reported in logits, with a mean of zero.

22
Psychometric Model:
Measurement Statistics
• Error
• Reliability
• Fit

23
Psychometric Model: Fit
• Estimates of the consistency of the ratings across
examiners, skills, and protocols, reported as the fit of
the data to the model. Fit statistics indicate inconsistent
rating patterns on any of the facets.
• Model expects observed ratings to be consistent:
• More able candidates should earn higher ratings more
frequently than less able candidates from all examiners
on skills within the protocols.
• More difficult skills and protocols cause lower ratings to
be awarded more frequently than easier skills and
protocols by all examiners.

24
Psychometric Model: Fit
• Fit statistic is the ratio of the observed rating to the
expected (modeled) rating
• 1 is perfect fit; range of acceptable fit is generally 0.5 to
1.5, although more stringent criteria have been
suggested for high-stakes examinations.

25
Fit Statistic: Examiners
• The fit statistics for examiners indicate the degree to
which each examiner is internally consistent across
candidates, skills, and protocols (intra-examiner
consistency).
• The fit statistic allows examiners who award
unexpectedly high or low ratings to some candidates on
some skills or protocols to be identified.

26
Fit Statistic: Candidates, Protocols and
Skills
• The fit statistic for each candidate, protocol and skill
indicates inter-examiner consistency.
• Misfit indicates that some examiners deviated
significantly from others when grading the skill or
protocol for some candidates.
• This information is useful for testing organizations to
monitor, and, if necessary, conduct additional analysis
to identify which rating situations are resulting in the
larger unexpected ratings.

27
Guidelines for Implementing a MFRM PA
• Development of the rating scale is critical
• Allows for a “disciplined dialogue” among examiners
about candidate performance
• Rating scale example: Unacceptable, Deficient,
Acceptable and Excellent
• Defining these terms and providing specific examples of
of candidate performance for each scale point is critical

28
Content Slide
• Content Slides

29
Content Slide
• Content Slides

30
Content Slide
• Content Slides

31
Content Slide
• Content Slides

32
Content Slide
• Content Slides

33
Content Slide
• Content Slides

34
Content Slide
• Content Slides

35
Content Slide
• Content Slides

36
Content Slide
• Content Slides

37
Content Slide
• Content Slides

38
Content Slide
• Content Slides

39
Thank You
If any questions, contact
rbrown2@measinc.com
Please complete the session evaluation that
has been distributed to you.

Performance Assessments with MFRM

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Performance Assessments with MFRM

Ähnlich wie Performance Assessments with MFRM (20)

Performance Assessments with MFRM

Hinweis der Redaktion