Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx

For most people, test
scores are an important
fact of life. But what
makes those numbers so
meaningful?

What are Test
Scores?
● expressed as
numbers
● used to describe,
make inferences,
and draw
conclusions from
numbers.

Scale- is a set of numbers (or other symbols)
whose properties model empirical properties
of the objects to which the numbers are
assigned.
○ Continuous Scale (e.g. blood pressure,
measurement to install venetian blinds)
○ Discrete Scale (e.g. male/female subjects)
Scales of Measurement

4 Scales of
Measurement (NOIR)
Nominal Scales
Ordinal Scales
Interval Scales
Ratio Scales
1
2
3
4

Nominal Scales
● could simply be called “labels.”

Ordinal Scales
• sounds like “order”

Interval Scales
• “space in between”
• don’t have “true zero”

Ratio Scales
• order, the exact value between units, AND have an absolute
zero.
• Example: reaction time, and individual scores such as
"number of items correctly recalled" or "number of errors".

Frequency Distribution
Simple frequency distribution
Grouped frequency distribution

How to illustrate Frequency Distribution
graphically:
3 Kinds of Graphs:
Histogram Bar graph Frequency polygon

Measures of Central
Tendency
Central Tendency- typical or average score of
group of scores
• Mean- “arithmetic mean” or “average”
• Median- the “middle” value
• Mode- the number that is repeated more
often than any other.

Mean
• denoted by the symbol X
̅
• is equal to the sum of the observations divided by
the number of observations.
For the raw scores
For a group frequency distribution

Median
• Middle score
• for odd= middle element
• for even= add two middle elements and divide by 2
Given: 66, 65, 61, 59, 53, 52, 41, 36, 35, 32
Step 1:
66, 65, 61, 59, 53, 52, 41, 36, 35, 32
Step 2:
105/2= 52.5

Mode
• The most frequently occurring score in a distribution of
scores.
Given: 66, 65, 61, 59, 53, 52, 41, 36, 35, 66, 32
Mode: 66

Measures of Variability
The Statistics that describe the amount of variation in
a distribution
• Range
• Interquartile Range
• Semi-interquartile Range
• Average Deviation
• Standard Deviation
• Variance

Range
• is equal to the difference between the highest and the
lowest scores.
Given the following sorted data, find the range.
12, 15, 19, 24, 25, 26, 30, 35, 38
R= HV-LV
R= 38-12
R= 26

The Interquartile and Semi-interquartile Ranges
• used to measure how spread out the data points in a
set are from the mean of the data set.
• IQR= Q3 - Q1.
• Semi-interquartile range= IQR/2
Interquartile range: 77-64= 13; Semi-interquartile range: 13/2= 6.5

Average Deviation
• Denoted as AD
• Formula:
• provides a solid foundation for understanding
the conceptual basis of another, more widely
used measure: the standard deviation.

Standard Deviation
• a measure of validity equal to the square root of
the variance
• Formula:

Variance
• Is the square of the standard deviation.
• Note: The larger the variance, the greater the
variability or the distance of scores from the
mean. The smaller the variance, the lesser the
variability.

Skewness
• an indication of how the measurements in a distribution
are distributed.

Kurtosis
• the steepness of a distribution in its center.

Standard Scores
● a raw score that has been converted from
one scale to another scale.
○ z Score
○ T Score
○ Stanines

Standard Scores
Why convert raw
scores to
standard scores?

Assumptions of
Psychological
Testing and
Assessment
7

Psychological Traits and States Exist
Psychological Traits and States Can Be Quantified
and Measured
Test-Related Behavior Predicts Non-Test-Related
Behavior
Tests and Other Measurement Techniques Have
Strengths and Weaknesses

Various Sources of Error Are Part of the Assessment
Process
Testing and Assessment Can Be Conducted in a Fair
and Unbiased Manner
Testing and Assessment Benefit Society

Norms
are the test performance data
of a particular group of
testtakers that are designed
for use as a reference when
evaluating or interpreting
individual test scores.

01 raw data converted to
percentage form
Percentile Norms
Types of Norms
02 Age-equivalent scores
Age Norms
03 population’s national
representative
04 provides stability to test
scores
National Anchor Norms
National Norms
05 average test performance
Grade Norms
06 e.g., social economics,
race
Subgroup Norms
07 provide
normative
information
Local
Norms

Norm-Referenced versus
Criterion-Referenced Evaluation
● Norm-Referenced Testing and Assessment-
evaluating an individual test taker’s score and
comparing it to scores of a group of test takes.
● Criterion-Referenced Testing and Assessment-
evaluation an individual’s score with reference to
a set standard.

01 test administering
Test Standardization
Sampling to Develop Norms
02 selecting a portion
Sampling
03 population representative
04 developing a sample
Stratified Sampling
Sample of a Population
05 test administering
Test Standardization
06 sample from population
close to hand
Convenience Sample

Correlation and Inference
● Inferences- deduced conclusions
● Correlation- an expression of the degree and
direction of correspondence between two
things.
○ Pearson r
○ Spearman’s rho
○ Regression

Do Do Not
Be aware of the cultural assumptions Take for granted that a test is based on
assumptions
Consider consulting with members of particular
cultural communities
Take for granted that members of all cultural
communities will automatically deem particular
techniques appropriate for use
Strive to incorporate assessment methods Take a “one-size-fits-all” view of assessment
Be knowledgeable about the many alternative
tests or measurement procedures
Select tests or other tools of assessment with
little or no regard for the extent to which such
tools are appropriate for use
Be aware of equivalence issues across cultures Assume that a test translated into another
language is automatically equivalent to the
original
Score, interpret, and analyze assessment data
in its cultural context
Score, interpret, and analyze assessment in a
cultural vacuum
Culturally Informed Assessment

Reliability
a degree to which scores from
a test are stable and results
are consistent.

01 stability, temporal
consistency
Test Re-Test
4 Ways to Assess Reliability
02 among independent
judges
Inner-rater
03 stability and
equivalence
04 homogeneity
Internal Consistency
Parallel/Alternate Forms

● Score on an ability test is presumed to
reflect not only the testtaker’s true score on
the ability being measured but also error
● has something to do with variance

01 Test Construction
Sources of Error Variance
02 Test Administration
03
04 Interpretation
Test Scoring

• Validity is a term used in conjunction with the meaningfulness
of a test score or in other words what the test score truly
means.
• Validity, as applied to a test, is a judgment or estimate of how
well a test measures what it purports to measure in a particular
context.
• Characterizations of the validity of tests and test scores are
frequently phrased in terms such as “acceptable” or “weak.”
VALIDITY

• One way measurement specialists have traditionally
conceptualized validity in according to three categories:
1. Content Validity
2. Criterion-related Validity
3. Construct Validity
THE THREE CATEGORIES OF VALIDITY

• Content validity refers to the extent to which a measure
represents all facets of a given construct.
• Refers to the extent to which the items on a test are fairly
representative of the entire domain the test seeks to measure.
• For example, a depression scale may lack content validity if it only
assesses the affective dimension of depression but fails to take
into account the behavioral dimension.
CONTENT VALIDITY

• One method of measuring content validity, developed by C. H.
Lawshe, is essentially a method for gauging agreement among
raters or judges regarding how essential a particular item is.
• Rated as either:
⮚ Essential
⮚ Useful but not essential or
⮚ Not necessary
CONTENT VALIDITY RATIO

CVR=(Ne - N/2)/(N/2)
Where CVR = content validity ratio, ne = number of panelists
indicating “essential,” and N = total number of panelists.
CVR FORMULA:

• A criterion is defined broadly as a standard on which a
judgment or decision may be based.
• Criterion-related validity is a judgment of how adequately a
test score can be used to infer an individual’s most probable
standing on some measure of interest.
• The measure of interest being the criterion.
CRITERION-RELATED VALIDITY

• Two types of validity evidence are subsumed
under the heading of criterion-related validity:
⮚ Concurrent validity
⮚ Predictive validity

• Concurrent validity refers to the extent to which the results
of a measure correlate with the results of an established
measure of the same or a related underlying construct
assessed within a similar time frame.
• On the other hand, If the measure is correlated with a future
assessment, this is termed predictive validity. It is the extent
to which a score on a scale or test predicts scores on some
criterion measure.

• A construct is an informed, scientific idea developed or
hypothesized to describe or explain behavior.
• Construct validity is a judgment about the appropriateness
of inferences drawn from test scores regarding individual
standings on a variable called a construct.
• is the degree to which a test measures what it claims, or
purports, to be measuring.
CONSTRUCT VALIDITY

• Increasingly, construct validity has been viewed as the
unifying concept for all validity evidence (American
Educational Research Association et al., 1999).
• As we noted at the outset, all types of validity evidence,
including evidence from the content- and criterion-
related varieties of validity, come under the umbrella of
construct validity.

• In everyday language, we use the term utility to refer to the usefulness of
something or some process.
• In the language of psychometrics, utility means much the same thing; it
refers to how useful a test is.
• More specifically, it refers to the practical value of using a test to aid
in decision-making.
• We may define utility in the context of testing and assessment as the
usefulness or practical value of testing to improve efficiency.
UTILITY

• Moreover, a test utility's judgment can easily be affected by
the test’s psychometric soundness, costs, and Its benefits.
⮚ Psychometric soundness (pertaining to the reliability and
validity of a test).
⮚ Cost (pertains to how much budget is put into the
test/research).
⮚ Benefit (answers the question: would the overall time and
effort of testing be even worth it?).

• A utility analysis may be broadly defined as a family of
techniques that entail a cost–benefit analysis designed to
yield information relevant to a decision about the usefulness
and/or practical value of a tool of assessment.
• In a most general sense, a utility analysis may be
undertaken for the purpose of evaluating whether the
benefits of using a test outweigh the costs.
UTILITY ANALYSIS

• The term cut/ cut-off score refers to the lowest
possible score on an exam, standardized test, high-
stakes test, or other form of assessment that a taker
must earn to either “pass” or be considered “proficient.”
• Here are some examples of methods for getting cut
scores:
METHODS FOR SETTING CUT SCORES

• Devised by William Angoff (1971) ,it is a kind of study that
test developers use to determine the passing percentage
(cutscore) for a test.
• This method for setting fixed cut scores can be applied to
personnel selection tasks as well as to questions regarding
the presence or absence of a particular trait, attribute, or
ability.
THE ANGOFF METHOD

• Also referred to as the method of contrasting groups, the
known groups method entails collection of data on the
predictor of interest from groups known to possess, and not
to possess, a trait, attribute, or ability of interest. Based on
an analysis of this data, a cut score is set on the test that
best discriminates the two groups’ test performance.
THE KNOWN GROUPS METHOD

• In this theory, cut scores are typically set based on test
takers' performance across all the items on the test; some
portion of the total number of items on the test must be
scored “correct” in order for the test taker to “pass” the
test.
• Whereas classical test theories focus on the test as a
whole, item response theory shifts its focus to the
individual items (questions) themselves.
IRT-BASED METHODS

• “All tests are not created equal.” The creation of a good test
is not a matter of chance.
• It is the product of the thoughtful and sound application of
established principles of test construction.
• Here, in making a good test, has a five step process.
TEST DEVELOPMENT

1. Test Conceptualization (come up with an idea/ a test idea).
2. Test Construction (draft up a plan of what builds and fashions the test).
3. Test Tryout (try out the test on sample takers/ participants).
4. Item Analysis (analyze on which items or areas of the test that needs
revising or changing. This includes eliminating the irrelevant parts).
5. Test Revision (the test is perfected by the revision of the second draft).
THE FIVE STAGES IN DEVELOPING A
TEST:

Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx

Recommended

Recommended

More Related Content

Similar to Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx

Similar to Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx (20)

Recently uploaded

Recently uploaded (20)

Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx

Editor's Notes