SlideShare ist ein Scribd-Unternehmen logo
1 von 39
1
Producing Unbiased Performance
Assessment Scores Using the Many-
Facet Rasch Model
Ross Brown, Ph.D.
Measurement Incorporated
2
Background
• For more than two decades, the MRA Division of MI has
used the many-facet Rasch model (MFRM) for the
analysis of client performance assessments.
• Using MFRM for performance assessments offers
benefits relating to measurement, fairness,
administration, resources, and security.
3
Other Session Objectives
• Understand the psychometric properties of a many-
facet Rasch measurement approach to performance
assessment scoring
• Understand how stakeholder concerns regarding a
MFRM approach can be addressed
• Understand the basics of setting up a performance
assessment for a MFRM analysis, as well as setting a
passing standard and equating that standard
4
Why Performance Assessments?
• Performance assessments complement written
examinations, allowing testing organizations to assess
candidates on higher-level decision-making abilities.
• Our clients often use performance assessments to
measure candidates’ abilities to apply skills such as
diagnosis, treatment, and management of complications
in a clinical context, replicating real-world patient
situations.
5
Performance Assessment Format
• Examiners rating candidate performance on
standardized protocols, i.e., hypothetical patient
scenarios, or candidates’ actual patients
• Candidates describe how they would diagnose and
treat.
• Some other permutations
• The methods we use for organizing and analyzing such
performance assessment can be used in different ways
and in different fields.
6
Benefits of MFRM: Fairness
• Different examiners have different levels of severity
when they assign ratings to candidate
• If candidate outcomes were determined based on raw
scores alone, the severity of individual examiners could
differentially affect candidates.
• MFRM allows for the severity of individual examiners to
be accounted for before candidate scores are
calculated.
7
Benefits of MFRM: Security
• Using MFR approach, different exam content (i.e.,
patient scenarios) can be used for different candidates,
reducing the likelihood that candidates will be able to
accurately disclose to other candidates information
about the exam content.
8
Benefits of MFRM: Fairness
• However, different exam content logically would have
different levels of difficulty.
• If candidate outcomes were determined using raw
scores, this differential difficulty could unfairly penalize
or benefit individual candidates.
• But calculating candidate scores using the MFRM takes
into account differences in the particular exam content
that different candidates are tested on.
9
Structuring a Performance Assessment
for MFRM Analysis
• Examiners interview candidates regarding exam 
material such as standardized patient
• Examiners lead the discussion, asking pointed
questions about how candidates would do such things
as diagnose patients illnesses, treat patients, and
manage complications.
10
Structuring a Performance Assessment
• Examiners use a rating scale, typically with four points
on it, to assign ratings to candidates’ responses.
• Candidates rotate between examiners who assess them
on different protocols.
• Ratings are assigned to specific skills, related to the
exam materials, such as diagnosis, treatment,
management of complications.
• Therefore, you have examiners rating candidates’
performance on skills within protocols.
11
Linking Facet Elements
Facets of the performance assessment:
• Candidates
• Examiners
• Protocols
• Skills
12
Linking Facet Elements
• To quantify and account for differences in the severity of
individual examiners and the difficulty of individual
protocols, the performance assessment must be
carefully structured so that there is overlap of
examiners’ ratings on candidates and protocols.
• This overlap links the different facet elements and allows
for differences between individual elements to be
quantified and accounted for.
13
Benefits: Resources and Administration
• No adjustments necessary if all candidates perform the
same skills on all protocols and are evaluated by the
same examiners.
• Reality: This is usually too expensive or logistically
impossible.
• MRFM: Candidates interact with some examiners on
selected protocols; each candidate takes a parallel
examination form.
14
Benefits: Resources and Administration
• The differences and biases in each of these examination
forms must be accounted for to make the candidate
ability estimates reasonably consistent, objective and
reproducible.
• Organizing a PA this way also affords benefits in terms
of the resources required to conduct the PA, and the
administration of the PA
15
Benefits: Resources and Administration
• Candidates moved through several pairs of examiners
who assess them on several protocols.
• A lot of performance information is collected efficiently
as several candidates are assessed simultaneously.
16
Benefits: Resources and Administration
• Like the regular Rasch model with only two facets
(persons and items), the MFRM produces candidate
ability estimates of known precision (error) and
reproducibility (reliability).
• Testing organizations can scale their performance
assessments so that they achieve the measurement
precision and reliability they desire with the resources
(time for administration, number of examiners) that they
have available.
17
Pnmijk = probability of person n being rated in category k by
examiner m on skill j in protocol i,
Pnmij(k-1) = probability of person n being rated in category
(k – 1) by examiner m on skill j in protocol i,
Bn = the ability of candidate n,
Sm = the severity of examiner m,
Ci = the difficulty of protocol i,
Dj = the difficulty of skill j, and
Fk = the difficulty of the step up from category (k – 1) to
Psychometric Model
18
Psychometric Model
• Probability of a performance: A function of the
difference between candidate ability and skill difficulty,
after adjustment for the severity of the examiner and the
difficulty of the protocol.
• If after adjustment, candidate's ability is higher, then the
probability of an acceptable performance is greater
than 50%.
• If after adjustment, skill difficulty is greater than the
ability of the candidate, the probability of achieving an
acceptable performance is less than 50%.
19
Psychometric Model:
Ordering Facet Elements
• Ordering of the candidates, examiners, protocols, and
skills on a linear scale provides a frame of reference for
understanding the relationship of the facets of the PA:
• Candidate ability (Bn) from highest to lowest
• Skill difficulty (Dj) from most to least difficult
• Examiner severity (Sm) from most to least severe
• Protocol difficulty (Ci) from most to least difficult.
20
Psychometric Model:
Sums of Ratings
• Ratings given by examiners are the basic units of
analysis.
• Skill difficulty is calculated from all ratings given to all
candidates by all examiners on the skill.
• Protocol difficulty includes all ratings given to all
candidates by all examiners on the protocol.
• Examiner severity includes the ratings given by the
examiner on all skills across all protocols to all
candidates encountered.
21
Psychometric Model: Logits
• Estimates are based on probability of performance
given the nature of the facets of the examination
encountered by a candidate.
• Log odds units or logits are used to construct an equal
interval scale.
• All facet element calibration estimates (candidate ability,
examiner severity, skill and/or protocol difficulty) are
reported in logits, with a mean of zero.
22
Psychometric Model:
Measurement Statistics
• Error
• Reliability
• Fit
23
Psychometric Model: Fit
• Estimates of the consistency of the ratings across
examiners, skills, and protocols, reported as the fit of
the data to the model. Fit statistics indicate inconsistent
rating patterns on any of the facets.
• Model expects observed ratings to be consistent:
• More able candidates should earn higher ratings more
frequently than less able candidates from all examiners
on skills within the protocols.
• More difficult skills and protocols cause lower ratings to
be awarded more frequently than easier skills and
protocols by all examiners.
24
Psychometric Model: Fit
• Fit statistic is the ratio of the observed rating to the
expected (modeled) rating
• 1 is perfect fit; range of acceptable fit is generally 0.5 to
1.5, although more stringent criteria have been
suggested for high-stakes examinations.
25
Fit Statistic: Examiners
• The fit statistics for examiners indicate the degree to
which each examiner is internally consistent across
candidates, skills, and protocols (intra-examiner
consistency).
• The fit statistic allows examiners who award
unexpectedly high or low ratings to some candidates on
some skills or protocols to be identified.
26
Fit Statistic: Candidates, Protocols and
Skills
• The fit statistic for each candidate, protocol and skill
indicates inter-examiner consistency.
• Misfit indicates that some examiners deviated
significantly from others when grading the skill or
protocol for some candidates.
• This information is useful for testing organizations to
monitor, and, if necessary, conduct additional analysis
to identify which rating situations are resulting in the
larger unexpected ratings.
27
Guidelines for Implementing a MFRM PA
• Development of the rating scale is critical
• Allows for a “disciplined dialogue” among examiners
about candidate performance
• Rating scale example: Unacceptable, Deficient,
Acceptable and Excellent
• Defining these terms and providing specific examples of
of candidate performance for each scale point is critical
28
Content Slide
• Content Slides
29
Content Slide
• Content Slides
30
Content Slide
• Content Slides
31
Content Slide
• Content Slides
32
Content Slide
• Content Slides
33
Content Slide
• Content Slides
34
Content Slide
• Content Slides
35
Content Slide
• Content Slides
36
Content Slide
• Content Slides
37
Content Slide
• Content Slides
38
Content Slide
• Content Slides
39
Thank You
If any questions, contact
rbrown2@measinc.com
Please complete the session evaluation that
has been distributed to you.

Weitere ähnliche Inhalte

Ähnlich wie Performance Assessments with MFRM

Chapter 6 selection
Chapter 6 selectionChapter 6 selection
Chapter 6 selection
Ahmed Salem
 
ASSESSMENT IN MEDICAL EDUCATION07122022.pptx
ASSESSMENT IN MEDICAL EDUCATION07122022.pptxASSESSMENT IN MEDICAL EDUCATION07122022.pptx
ASSESSMENT IN MEDICAL EDUCATION07122022.pptx
Muralidharanp13
 
Maggie Cruickshank - Optional work place assessment of training
Maggie Cruickshank - Optional work place assessment of training   Maggie Cruickshank - Optional work place assessment of training
Maggie Cruickshank - Optional work place assessment of training
triumphbenelux
 

Ähnlich wie Performance Assessments with MFRM (20)

Surveying the landscape: An overview of tools for direct observation and asse...
Surveying the landscape: An overview of tools for direct observation and asse...Surveying the landscape: An overview of tools for direct observation and asse...
Surveying the landscape: An overview of tools for direct observation and asse...
 
Chapter 6 selection
Chapter 6 selectionChapter 6 selection
Chapter 6 selection
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Reliability and validity w3
Reliability and validity w3Reliability and validity w3
Reliability and validity w3
 
Validity and-reliability-of-assessments
Validity and-reliability-of-assessmentsValidity and-reliability-of-assessments
Validity and-reliability-of-assessments
 
Ch06
Ch06 Ch06
Ch06
 
CRO - Clinical Vendor Oversight Webinar.
CRO - Clinical Vendor Oversight Webinar.CRO - Clinical Vendor Oversight Webinar.
CRO - Clinical Vendor Oversight Webinar.
 
Moderation workshop dietetics april 2017b
Moderation workshop dietetics april 2017bModeration workshop dietetics april 2017b
Moderation workshop dietetics april 2017b
 
CRO and Vendor Oversight: Clinical
CRO and Vendor Oversight: ClinicalCRO and Vendor Oversight: Clinical
CRO and Vendor Oversight: Clinical
 
Selection tests and Reliability and Validity in HRM
Selection tests and Reliability and Validity in HRMSelection tests and Reliability and Validity in HRM
Selection tests and Reliability and Validity in HRM
 
Unit 5 - Lect 1 Benchmarking the Laboratory.pptx
Unit 5 - Lect 1 Benchmarking the Laboratory.pptxUnit 5 - Lect 1 Benchmarking the Laboratory.pptx
Unit 5 - Lect 1 Benchmarking the Laboratory.pptx
 
Effect of Computer-Based Testing on Candidates
Effect of Computer-Based Testing on CandidatesEffect of Computer-Based Testing on Candidates
Effect of Computer-Based Testing on Candidates
 
ASSESSMENT IN MEDICAL EDUCATION07122022.pptx
ASSESSMENT IN MEDICAL EDUCATION07122022.pptxASSESSMENT IN MEDICAL EDUCATION07122022.pptx
ASSESSMENT IN MEDICAL EDUCATION07122022.pptx
 
PLAN ASSESSMENT ACTIVITIES AND PROCESSES
PLAN ASSESSMENT ACTIVITIES AND PROCESSESPLAN ASSESSMENT ACTIVITIES AND PROCESSES
PLAN ASSESSMENT ACTIVITIES AND PROCESSES
 
Assessment within the 2015 RTO Standards
Assessment within the 2015 RTO StandardsAssessment within the 2015 RTO Standards
Assessment within the 2015 RTO Standards
 
Maggie Cruickshank - Optional work place assessment of training
Maggie Cruickshank - Optional work place assessment of training   Maggie Cruickshank - Optional work place assessment of training
Maggie Cruickshank - Optional work place assessment of training
 
Selection and Selection Process
Selection and Selection ProcessSelection and Selection Process
Selection and Selection Process
 
The Role of Proficiency Testing in Laboratory Quality Assurance
The Role of Proficiency Testing in Laboratory Quality AssuranceThe Role of Proficiency Testing in Laboratory Quality Assurance
The Role of Proficiency Testing in Laboratory Quality Assurance
 
Principles of hr management ppt slides
Principles of hr management ppt slidesPrinciples of hr management ppt slides
Principles of hr management ppt slides
 
Public Safety Hiring Tutorial
Public Safety Hiring TutorialPublic Safety Hiring Tutorial
Public Safety Hiring Tutorial
 

Performance Assessments with MFRM

  • 1. 1 Producing Unbiased Performance Assessment Scores Using the Many- Facet Rasch Model Ross Brown, Ph.D. Measurement Incorporated
  • 2. 2 Background • For more than two decades, the MRA Division of MI has used the many-facet Rasch model (MFRM) for the analysis of client performance assessments. • Using MFRM for performance assessments offers benefits relating to measurement, fairness, administration, resources, and security.
  • 3. 3 Other Session Objectives • Understand the psychometric properties of a many- facet Rasch measurement approach to performance assessment scoring • Understand how stakeholder concerns regarding a MFRM approach can be addressed • Understand the basics of setting up a performance assessment for a MFRM analysis, as well as setting a passing standard and equating that standard
  • 4. 4 Why Performance Assessments? • Performance assessments complement written examinations, allowing testing organizations to assess candidates on higher-level decision-making abilities. • Our clients often use performance assessments to measure candidates’ abilities to apply skills such as diagnosis, treatment, and management of complications in a clinical context, replicating real-world patient situations.
  • 5. 5 Performance Assessment Format • Examiners rating candidate performance on standardized protocols, i.e., hypothetical patient scenarios, or candidates’ actual patients • Candidates describe how they would diagnose and treat. • Some other permutations • The methods we use for organizing and analyzing such performance assessment can be used in different ways and in different fields.
  • 6. 6 Benefits of MFRM: Fairness • Different examiners have different levels of severity when they assign ratings to candidate • If candidate outcomes were determined based on raw scores alone, the severity of individual examiners could differentially affect candidates. • MFRM allows for the severity of individual examiners to be accounted for before candidate scores are calculated.
  • 7. 7 Benefits of MFRM: Security • Using MFR approach, different exam content (i.e., patient scenarios) can be used for different candidates, reducing the likelihood that candidates will be able to accurately disclose to other candidates information about the exam content.
  • 8. 8 Benefits of MFRM: Fairness • However, different exam content logically would have different levels of difficulty. • If candidate outcomes were determined using raw scores, this differential difficulty could unfairly penalize or benefit individual candidates. • But calculating candidate scores using the MFRM takes into account differences in the particular exam content that different candidates are tested on.
  • 9. 9 Structuring a Performance Assessment for MFRM Analysis • Examiners interview candidates regarding exam  material such as standardized patient • Examiners lead the discussion, asking pointed questions about how candidates would do such things as diagnose patients illnesses, treat patients, and manage complications.
  • 10. 10 Structuring a Performance Assessment • Examiners use a rating scale, typically with four points on it, to assign ratings to candidates’ responses. • Candidates rotate between examiners who assess them on different protocols. • Ratings are assigned to specific skills, related to the exam materials, such as diagnosis, treatment, management of complications. • Therefore, you have examiners rating candidates’ performance on skills within protocols.
  • 11. 11 Linking Facet Elements Facets of the performance assessment: • Candidates • Examiners • Protocols • Skills
  • 12. 12 Linking Facet Elements • To quantify and account for differences in the severity of individual examiners and the difficulty of individual protocols, the performance assessment must be carefully structured so that there is overlap of examiners’ ratings on candidates and protocols. • This overlap links the different facet elements and allows for differences between individual elements to be quantified and accounted for.
  • 13. 13 Benefits: Resources and Administration • No adjustments necessary if all candidates perform the same skills on all protocols and are evaluated by the same examiners. • Reality: This is usually too expensive or logistically impossible. • MRFM: Candidates interact with some examiners on selected protocols; each candidate takes a parallel examination form.
  • 14. 14 Benefits: Resources and Administration • The differences and biases in each of these examination forms must be accounted for to make the candidate ability estimates reasonably consistent, objective and reproducible. • Organizing a PA this way also affords benefits in terms of the resources required to conduct the PA, and the administration of the PA
  • 15. 15 Benefits: Resources and Administration • Candidates moved through several pairs of examiners who assess them on several protocols. • A lot of performance information is collected efficiently as several candidates are assessed simultaneously.
  • 16. 16 Benefits: Resources and Administration • Like the regular Rasch model with only two facets (persons and items), the MFRM produces candidate ability estimates of known precision (error) and reproducibility (reliability). • Testing organizations can scale their performance assessments so that they achieve the measurement precision and reliability they desire with the resources (time for administration, number of examiners) that they have available.
  • 17. 17 Pnmijk = probability of person n being rated in category k by examiner m on skill j in protocol i, Pnmij(k-1) = probability of person n being rated in category (k – 1) by examiner m on skill j in protocol i, Bn = the ability of candidate n, Sm = the severity of examiner m, Ci = the difficulty of protocol i, Dj = the difficulty of skill j, and Fk = the difficulty of the step up from category (k – 1) to Psychometric Model
  • 18. 18 Psychometric Model • Probability of a performance: A function of the difference between candidate ability and skill difficulty, after adjustment for the severity of the examiner and the difficulty of the protocol. • If after adjustment, candidate's ability is higher, then the probability of an acceptable performance is greater than 50%. • If after adjustment, skill difficulty is greater than the ability of the candidate, the probability of achieving an acceptable performance is less than 50%.
  • 19. 19 Psychometric Model: Ordering Facet Elements • Ordering of the candidates, examiners, protocols, and skills on a linear scale provides a frame of reference for understanding the relationship of the facets of the PA: • Candidate ability (Bn) from highest to lowest • Skill difficulty (Dj) from most to least difficult • Examiner severity (Sm) from most to least severe • Protocol difficulty (Ci) from most to least difficult.
  • 20. 20 Psychometric Model: Sums of Ratings • Ratings given by examiners are the basic units of analysis. • Skill difficulty is calculated from all ratings given to all candidates by all examiners on the skill. • Protocol difficulty includes all ratings given to all candidates by all examiners on the protocol. • Examiner severity includes the ratings given by the examiner on all skills across all protocols to all candidates encountered.
  • 21. 21 Psychometric Model: Logits • Estimates are based on probability of performance given the nature of the facets of the examination encountered by a candidate. • Log odds units or logits are used to construct an equal interval scale. • All facet element calibration estimates (candidate ability, examiner severity, skill and/or protocol difficulty) are reported in logits, with a mean of zero.
  • 22. 22 Psychometric Model: Measurement Statistics • Error • Reliability • Fit
  • 23. 23 Psychometric Model: Fit • Estimates of the consistency of the ratings across examiners, skills, and protocols, reported as the fit of the data to the model. Fit statistics indicate inconsistent rating patterns on any of the facets. • Model expects observed ratings to be consistent: • More able candidates should earn higher ratings more frequently than less able candidates from all examiners on skills within the protocols. • More difficult skills and protocols cause lower ratings to be awarded more frequently than easier skills and protocols by all examiners.
  • 24. 24 Psychometric Model: Fit • Fit statistic is the ratio of the observed rating to the expected (modeled) rating • 1 is perfect fit; range of acceptable fit is generally 0.5 to 1.5, although more stringent criteria have been suggested for high-stakes examinations.
  • 25. 25 Fit Statistic: Examiners • The fit statistics for examiners indicate the degree to which each examiner is internally consistent across candidates, skills, and protocols (intra-examiner consistency). • The fit statistic allows examiners who award unexpectedly high or low ratings to some candidates on some skills or protocols to be identified.
  • 26. 26 Fit Statistic: Candidates, Protocols and Skills • The fit statistic for each candidate, protocol and skill indicates inter-examiner consistency. • Misfit indicates that some examiners deviated significantly from others when grading the skill or protocol for some candidates. • This information is useful for testing organizations to monitor, and, if necessary, conduct additional analysis to identify which rating situations are resulting in the larger unexpected ratings.
  • 27. 27 Guidelines for Implementing a MFRM PA • Development of the rating scale is critical • Allows for a “disciplined dialogue” among examiners about candidate performance • Rating scale example: Unacceptable, Deficient, Acceptable and Excellent • Defining these terms and providing specific examples of of candidate performance for each scale point is critical
  • 39. 39 Thank You If any questions, contact rbrown2@measinc.com Please complete the session evaluation that has been distributed to you.

Hinweis der Redaktion

  1. Thus, all estimates are derived from sums of ratings, so a careful system of overlap must be included in the examination to insure an accurate representation of the relationship of the facets of the examination.
  2. Estimates can range from 0 to ± 4, usually with the scale set up so that high positive indicates more severe, more able, more difficult and high negative means lenient, less able, and easy.