Unit. 7.doc

Unit–7
Test Development and Qualities of a test
Written by:
Dr. Fayyaz Ahmad Faize

TABLE OF CONTENTS
1. Achievement Test ................................................................................................................................. 3
1.1 Purposes/uses of achievement test ............................................................................................... 5
2. Attitude Scale......................................................................................................................................... 6
3. Steps for Test Development................................................................................................................... 8
4. Reliability............................................................................................................................................. 15
4.1 Reliability Coefficient ................................................................................................................. 16
4.2 Relationship between Validity and Reliability .......................................................................... 17
5. Reliability Types.................................................................................................................................. 18
5.1 Test-Retest Reliability ...................................................................................................................... 18
5.2 Equivalence Reliability or inter-class reliability......................................................................... 19
5.3 Split-Halves Reliability............................................................................................................... 20
6. Factors Affecting Reliability................................................................................................................ 21
7. Validity ............................................................................................................................................... 23
7.1 Content-related validity............................................................................................................... 24
7.2 Criterion-related validity............................................................................................................. 26
7.2.1 Concurrent Validity..................................................................................................................... 27
7.2.2 Predictive Validity ..................................................................................................................... 27
7.3 Construct-related validity............................................................................................................ 28
Self-Assessment Questions...................................................................................................................... 29
8. References............................................................................................................................................ 30

OBJECTIVES
After studying this chapter, the students will be able to:
 Describe about achievement test and attitude scale
 Explain the steps involved in test development
 Describe the qualities of a good test
 Define and interpret reliability and validity
 Discuss how to determine reliability and validity of tests
 Understand the relationship between reliability and validity.
 Understand the basic kinds of validity evidence.
 Interpret various expressions of validity.
 Recognize what factors affect validity
1. ACHIEVEMENT TEST
Achievement tests are designed to measure accomplishment. Usually, it is conducted at the end of
some learning activity/process to ascertain the degree to which the required task has been
accomplished.
For example, the achievement test for a students of Nursery class might contain assessment of
English alphabets, knowledge of numbers and key science concepts. Thus, achievement tests help
in measuring the degree of learning on some already instructed/guided tasks. The tasks may be
specific and short or it may be comprehensive and detailed. An achievement test may be
standardized such as a test of Chemistry for secondary class on formulae and valences or Physics
test on fundamental quantities or kinematics.

Another term that is also useful is ‘General Achievement”. This relates to measuring of learning
experiences in one or more academic areas. This would usually involve a number of subtests each
aimed at measuring some specific learning experiences/targets. These subtests are sometimes
called achievement batteries. Such batteries may be individually administered or group
administered. They may consist of a few subtests, as does the Wide Range Achievement Test-4
(Wilkinson & Robertson, 2006) with its measures of reading, spelling, arithmetic, and (new to the
fourth edition) reading comprehension.
Achievement may be as comprehensive as the STEP Series, which includes subtests in reading,
vocabulary, mathematics, writing skills, study skills, science, and social studies; a behavior
inventory; an educational environment questionnaire; and an activities inventory. Some batteries,
such as the SRA California Achievement Tests, span kindergarten through grade 12, whereas
others are grade or course-specific. Some batteries are constructed to provide both norm-referenced
and criterion-referenced analyses. Others are concurrently normed with scholastic aptitude tests to
enable a comparison between achievement and aptitude. Some batteries are constructed with
practice tests that may be administered several days before actual testing to help students
familiarize themselves with test taking procedures. One popular instrument appropriate for use
with person age 4 through adult is the Wechsler Individual Achievement Test-Second Edition,
otherwise known as the WIAT-II (Psychological Corporation, 2001). This instrument is used not
only to gauge achievement but also to develop hypotheses about achievement versus ability. It
features nine subtests that samples content in each of the seven areas listed in a past revision of the
Individuals with Disabilities Education Act: oral expression, listening comprehension, written
expression, basic reading skill, reading comprehension, mathematics calculation, and mathematics
reasoning.
For a particular purpose, a battery that focuses on achievement in a few select areas may be

preferable to one that attempts to sample achievement in several areas. On the other hand, a test
that samples many areas may be advantageous when an individual comparison of performance
across subject areas is desirable. If a school or a local school district undertakes to follow the
progress of a group of students as measured by a particular achievement battery, then the battery of
choice will be one that spans the targeted subject areas in all the grades to be tested. If ability to
distinguish individual areas of diffic ulty is of primary concern, then achievement tests with strong
diagnostic features will be chosen. Although achievement batteries sampling a wide range of areas,
across grades, and standardized on large, national samples of students have much to recommend
them, they also have certain drawbacks. For example, such tests usually take years to develop; in
the interim the items, especially infi elds such as social studies and science, may become outdated.
Further, any nationally standardized instrument is only as good as the extent to which it meets the
(local) test user’s objectives.
1.1 Purposes/uses of achievement test
i. To measure students’ mastery of certain essential skills and knowledge, such as proficiency
in recalling facts, understanding concepts, principles and use of skills
ii. To measure students’ growth/progress over time for promotion purposes. This is helpful to
school in making decision about students’ placement in a specific program, class, group or
for promoting to next level.
iii. To rank pupils in terms of their achievement by comparing performance of an individual to
the norm or average performance of his/her group (norm referenced)
iv. To Identify pupil’s problem and diagnosing them. Given a federal mandate to identify
children with a “severe discrepancy between achievement and intellectual ability”
(Procedures for Evaluating Specific Learning Disabilities, 1977, p. 65083), it can readily be

appreciated how achievement tests—as well as intelligence could play a role in the
diagnosis of a specific learning disability (SLD).
v. To evaluate the effectiveness of teacher's instructional method
vi. To encourage good study habits in the students and motivate them to work hard.
2. ATTITUDE SCALE
An attitude may be defined formally as a presumably learned disposition to react in some
characteristic manner to a particular stimulus. The stimulus may be an object,
a group, an institution—virtually anything. Although attitudes do not necessarily predict
behavior (Tittle & Hill, 1967; Wicker, 1969), there has been great interest in measuring
the attitudes of employers and employees toward each other and toward numerous
variables in the workplace. As the name implies, this type of scale tries to measure individual’s
belief, attitude and perception towards one self, others or towards some phenomena, activities,
situation etc.
2.1 Measuring Attitude
Attitude can be measured using the following scales.
Attitude can be measured using self-report, tests and/or questionnaires. However, it is not
easy to measure attitude accurately as individuals greatly differ in their ability to rightly introspect
about their attitudes and in their level of self-awareness. Moreover, some people also feel reluctant
to share or report about their attitude to others. It may also happen that some time people come with
some attitude or form attitude that they did not know about it or existed earlier.
Measuring attitude was earlier mentioned by Likert (1932) in his monograph, “A Technique for the
Measurement of Attitudes”. This relates to designing an instrument that helps in measuring

attitude. This scale seeks individual’s response on a number of statement in terms of his/her level of
agreement or disagreement. The options may be Strongly Agree, Agree, Undecided, Disagree,
Strongly Disagree. The degree of agreement or disagreement reflects individual attitude about a
certain phenomenon or statement. Each response is assigned a specific score from 1 to 5. For
positive statement, 5 is assigned to strongly agree and 1 is assigned to strongly disagree.
According to Thurstone (1928), attitude can be measured as mentioned in his article,
“Attitudes Can Be Measured”. Recently, the research of Banaji (2001) further supported this
contention in his article, “Implicit Attitudes Can Be Measured”. Implicit attitudes are
“introspectively unidentifi ed (or inaccurately identified) traces of past experience that mediate
favorable or unfavorable feeling, thought, or action toward social objects” (Greenwald & Banaji,
1995, p.8). Stated another way, they are nonconscious, automatic associations in memory that
produce dispositions to react in some characteristic manner to a particular stimulus.
Implicit attitude can be measured using Implicit Attitude Test (IAT), a computerized sorting
task by which implicit attitudes are gauged with reference to the test taker’s reaction times. E.g. the
individual is shown/given a particular stimuli and is asked to categorize or associate another word
or phrase to it without taking much time. For example, the attitude of a person can be gauged by
presenting the word ‘terror’ and then associating other words favorable or unfavorable to it quickly
to know about individual attitude to ‘terror’. Using the IAT or similar protocols, implicit attitudes
toward a wide range of stimuli can be measured. Likewise, implicit attitudes have been studied in
relation to racial prejudices, threats, voting behavior, professional ethics, self-esteem, drug use etc.
Measuring implicit attitude is now frequently used in consumer
psychology and consumer preferences. In consumer psychology, the attitude may be found through
asking a series of questions about a product or choice and the individual response is noted which

may reflect the belief or thinking of the individual. The responses of people can be sought through
a survey or opinion poll using questionnaire, emails, google forms, social medic posts etc. the
surveys and polls may be conducted by means of face-to-face, online, and telephone interviews, as
well as by mail. The face-to-face interaction helps in getting quicker response and in understanding
the questions well. Moreover, the researcher can present or show the products directly and seeks
people’s response on it. However, there is also a drawback of face to face interaction as sometime,
the people would react in a way they feel is favorable to the researcher or the gesture of researcher
influences the choice of the respondents.
Another type of scale to measure attitude is the semantic differential technique. In this type
of scale, the respondents are given two opposite extremes and the individual is asked to place a
mark on the 7 spaces in the continuum according to his level of preference. The two bipolar
extremes might be easy-difficult, good-bad, weak-strong etc.
Strong : : : : : : Weak
Decisive : : : : : : Indecisive
Good : : : : : : Bad
Cheap : : : : : : Expensive
3. STEPS FOR TEST DEVELOPMENT
The creation of a good test is not a matter of chance, rather it requires a sound knowledge and
principles of test construction (Cohen & Swerdlik, 2010). The development of a good test
requires some steps however; these steps are not specific as various authors have suggested

different steps/stages for developing a test. Following are some of the general steps for test
development.
1. Identification of objectives
It is one of the most important step in developing any test when the test authors need to consider
in detail what exactly they aim to measure or the purpose of the test. It is especially important to
define clearly the purpose of the test because that increases the possibility for achieving high
validity. It defines what exactly is required to be measured by a test. This will help in improving
the validity of a test. There are two kinds of objectives: the behavioral and non-behavioral. As the
name suggest, the behavioral objectives deal with “activities that are observable and measurable
whereas non-behavioral objectives specify activities that are unobservable and not directly
measurable” (Reynolds et al., 2009, p. 177). Without predefined objectives, a test will be
meanings and purposeless.
2. Deciding about test format
The format/design of the test is another important element on constructing a test. The test
developer needs to decide about which format/design will be the most suitable in achieving the set
objectives. The format of the test may be objectives type, essay type or both. Again, the examiner
will decide about what type of objective items shall be included whether it will be multiple-choice,
fill in the blanks, matching items, short answer etc. The test author also decides about the number
of marks assigned to each format and the total amount of time to complete the test.
3. Making a table of specifications
A table to specifications serves as test blueprint. This helps in ensuring suitable number of items
from the whole content as well as specifying the type of assessment objectives that the items will

4. Writing Items
be testing. This table ensures that all levels of instructional objectives are used in the test
questions. The table enlists the number of items from each content area, the weightage assigned to
each content area and the type of instructional objectives the items will be measuring whether
recalling, understanding or application. Last but not the least, the examiner shall also decide about
the weightage to each format (objective and subjective) within the test and the weightage in terms
of difficulty level (easy, moderate, difficult). For example, in developing an English test, the
teacher can focus on the following areas.
 What language skills should be included – will there be a list of grammatical structures and
lexis, etc.;
 What sort of tasks are required – objectively assessable, integrative, simulated
“authentic”, etc.;
 How many items are required for each section, and what their relative weight will be
equal weighting or extra weighting for more difficult items;
 What test methods are to be used – multiple choice, gap filling, matching,
transformations, picture descriptions, essay writing, etc.;
 What rubrics are to be used as instructions for students – will there be included examples
to help students know what is expected, and
 should the assessment criteria be added to the rubric;
 What assessment criteria will be used – how important is accuracy, spelling, length of
written text, etc.

The examiner writes the items keeping in mind the table of specification and the difficulty level of
items. The items shall progress from simple to difficult however, it is debatable whether the items
are arranged randomly or from easy to difficult. The examiner should ensure that the test can be
completed within the stipulated time. The language of the test items be simple, brief and lucid.
The language should be checked for grammar, spelling and punctuation.
5. Preparation of Marking scheme
The test developer decides about the number of marks to be assigned to each item or the relevant
bits of detail in the students’ answers. This is necessary to ensure consistency in marking and to
make scoring more scientific and systematic. The essay type questions can be divided into smaller
components and the marks defined for each important concept/point.
As regarding developing standardized type of test, the following steps are given by Cohen and
Swerdlik (2010) though it can also be applied to custom test made by teachers, researchers and
recruiters. The process encompasses five stages:
1. test conceptualization
2. test construction
3. test tryout
4. item analysis
5. test revision
The process of test development starts from the conceptualizing the idea of test and the purpose
for which the test has to be constructed. A test may be designed on some emerging phenomena,
problems, issues or some needs. Test conceptualization might also include the construct or the
concepts which the test should measure. What kind of objectives or behavior the test should

4. Writing Items
measure in the presence of other such tests? In there any need for making a test or the existing

test can be used for the set purpose? How the test can be better than the existing test? Who will be
the user of the test, the students, teachers, or employers? What will be the content of the test? How
will the test be administered, individually or in groups? Will the test be written, oral or practical?
What will be the format of test items and what will be the proportion of items in objective and
subjective? Who will benefit from the test? Will there be any harmful effect of the test? If yes, then
on whom?
Based on the purpose, needs and the objectives to be achieved, the items for the test are
constructed/selected. The test is then pilot tested on a sample to try out whether the items in the test
are appropriate for achieving the set objectives. Based on the result from the test tryout or pilot
test, the items in the test are put to item analysis. This requires the use of statistical procedures in
determining the difficulty level of items, reliability and validity. This process helps in selecting the
appropriate items for the test while the inappropriate items may be revised or deleted. This finally
helps in making a revised draft of the test better than the initial version of the test. The process may
be repeated till a refined and standardized type of version is made available.
References.
1. Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation,
Cambridge University Press, 1995
2. Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests and
measurement. 7th ed. McGraw-Hill Primis; 2010.
3. Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning
distractors in multiple-choice questions: a descriptive analysis,
http://www.springerlink.com/content/e8k861 8552465484/fulltext.pdf, 2009.
4. QUALITIES OF A GOOD TEST

In constructing a test, the test developer should aim at making a good test. A bad test may spoil
the purpose of the test and thus would be useless to administer. According to Mohamed (2011), A
good test should have the following properties.
Objectivity
This is very important for a test to be objective. A test with higher objectivity will eliminate
personal biases and influences in scoring and interpreting the test result. This can be done by
including more objective type items in the test. This includes multiple choice questions, fill in the
blanks, true-false, matching items, short questions-answers etc. In contrast, essay questions are
subjective questions. Thus, difference examiners may arrive at different answers while checking
such questions depending upon the mood of person, knowledge level and personal likes and dislikes.
However, the essay type questions can be made objective through well-defined marking scheme
for small bits of important and relevant information in the long answers.
Comprehensiveness
A good test should cover the content area which is taught. The items in the test should be from
different areas of the course content. If one topic or area is assigned more question items and the
other areas are neglected then, such test will not be a good one. A good English test may have
items taken from composition, comprehension, dialogue, creative writing, grammar, vocabulary
etc. Meanwhile, due importance may be given to important bits in the content according to its
utility and significance.
Validity
It means that a test should rightly measures what it is supposed to measure. It tests what it ought to
test. A good test which measures control of grammar should have no difficult lexical items.
The detail of validity is explained in validity section.

Reliability
Reliability of a test refers to the degree of consistency with which it measures what it intended to
measure. If a test is re-taken by same students under same conditions, the score will be almost the
same provided that the time between the test and the retest is of reasonable length. In this case it is
said that the test provides consistency in measuring the items being evaluated.
Details about reliability is given in reliability section.
Discriminating Power
Discriminating power of the test is its power to discriminate between the upper and lower groups
who took the test. Thus, a good test should not contain only difficult items or easy items rather, it
should contain items with different difficulty level to sift students with different intelligence level.
The questions should progressively be increased in difficulty to reduce stress level and tension in
students.
Practicability
The test should be realistic and practicable. It should not measure unrealistic targets or objectives.
The test should also be easy to administer as well as easy to score. The test should also be
economical without wasting too much resources, energies and effort. Tests may be
competitive and sometimes difficult to complete within stipulated time to select students with
higher IQ level and less reaction time because such tests may have this specific purpose.
Otherwise, classroom tests shall keep in mind the individual differences of students and provide
ample opportunity for its completion.
Simplicity

It refers to clarity in language, correctness, adequacy and lucidity. Ambiguous questions, and items
with multiple meanings should be avoided. The students should be very clear about what the
question is asking and how to answer. Sometimes, the students get confused about the possible
answers due to lack of clarity in the questions.
5. RELIABILITY
According to Gay, Mills, & Airasian, (2011), “Reliability is the degree to which a test
consistently measures whatever it measures”.
Thorndike (2005) refers reliability to “accuracy or precision of a measurement procedure”.
While, Mehrens and Lehmann (1991) defined reliability as “the degree of consistency between
two measures of the same thing”
It also signifies the repeatability of observations or scores on a measurement. Some other terms
that are used to define reliability includes dependability, stability, accuracy, regularity in
measurement.
For a test, high reliability would mean that the person gets the same score or nearly same each
time the test is administered to the same person. If the person obtains different score each time the
test is administered, then the test reliability will be questioned.
Reliability can be ascertained by the examiner by taking the same test on two different
occasions. The score obtained on the test on the two occasions may be compared to determine the
degree of reliability. Another method is to test students on one test and then administer another but
different test. The scores obtained by the students on the two test may be compared to find
reliability of the two tests. If there is much difference in the score of students on the two tests, then
the two tests will have poor reliability. Essay type questions may have poor reliability as the
students get different score each time the answers are marked. In comparison, multiple choice

questions have comparatively a higher reliability as compared to essay type questions.
A test may not be reliable in all settings. A test may be reliable in a specific situation, under
specific circumstances and with a specific group of subjects. However, it may not be reliable in a
different situation or with a different group of students under a different circumstance.
5.1 Reliability Coefficient
As regarding physical measurement or using different tests for ascertaining reliability, it may be
difficult to achieve 100% consistency in scores. However, an acceptable value will be the degree of
closeness or consistency in the measurement of the different tests. For this purpose, the degree of
reliability of a test is measured numerically which is termed as reliability coefficient. According to
Merriam Webster dictionary, reliability coefficient is a measure of the accuracy of a test or
measuring instrument obtained by measuring the same individuals twice and computing the
correlation of the two sets of measures.
The reliability coefficient is a way of confirming how accurate a test or measure is. It essentially
measures consistency in scoring. The reliability coefficient is found by giving the test to the same
subject more than once and determining if there's a correlation between the two scores. This will
also reveal the strength of the relationship and similarity between the two scores. If the
two scores are close enough, then the test can be said to be accurate and has good reliability. The
variation in the score is called error variance and the source of variation is called source of error.
The smaller the error, the more consistent and reliable the score and vice versa. An example could
be done in which an individual is given a measure to determine their self-esteem levels and then
given the same measure again. The two scores would be correlated and the reliability coefficient
would be produced. If the scores are very similar to each other then it can be said they are reliable
that are consistently measuring the same thing, which in this case would be self-esteem.
The maximum value of reliability coefficient is 1.00 which means that the test is perfectly

reliable while the minimum value is 0.00 which indicates no reliability. However, in actual
situation, it is not possible to have a perfectly reliable test. Thus, the coefficient of reliability will
be less than 1.00. The reason is the effect of various factors and errors in measurement. This
includes errors caused by the test itself due to ambiguous test items which is interpreted differently
by students. The different in conditions of students (emotionally, physically, mentally etc.) is also
responsible for producing errors in measurement such as fatigue factor, arousal of specific emotion
such as anger, fear, depression etc. and lack of motivation. Moreover, the selection of test items, its
construction, sequence, wording etc. may also result in measurement error and thus affecting the
reliability coefficient,
5.2 Relationship between Validity and Reliability
A test which is valid is also reliable. However, a test which is reliable is not necessarily
valid. If a test is valid, it means that it is rightly measuring the purpose/objectives what it is
supposed to be measuring. Thus, the score obtained on such test is also reliable because the test is
rightly measuring its intended purpose and thus the score will also be consistent on such test
whether lower or higher. In comparison, if a test is reliable which means that the students’ score is
coming consistently the same, but this test may not be rightly measuring its intended purpose and
thus is invalid. Thus, a test which is reliable may be valid or it may not be valid but a test which
is valid must be reliable. A test with coefficient of reliability as 0.93 is a highly reliable test but is
the test really measuring the set objectives from the given content. If it measures its intended
purpose, then the test will also be valid. However, if it is not measuring the concepts from the
given content then it will be in valid. [Form more detail see Gay, Mills, & Airasian, (2011)]
6. RELIABILITY TYPES
Some types of reliability are given below:

 Test-Retest Reliability
 Equivalence Reliability or inter-class reliability
 Split-Halves Reliability
6.1 Test-Retest Reliability
One of the simple way to determine reliability of a test is to test-retest. It is the degree to which
scores on a test are consistent over time. The subjects are given a test on two occasions. The score
obtained by the subjects are then compared to see the consistency in the two scores on both the
tests. This can be found by measuring the correlation between the two scores. If the correlation
coefficient is high, then the two tests have a high degree of reliability. This method is seldom used
by subject teachers but is frequently used by test developers or commercial test publishers such as
IELTS, TOEFL, GRE etc.
One issue that arises here is how much time should elapse between the two tests. If the time
interval between the two tests is short say a few hours or days, then the chances of students
remembering their previous answers will be high and thus they will score the same which will
increase the reliability coefficient. If the duration is long, then the ability to perform well on the
test increases due to learning with time thus affecting reliability coefficient. Thus, in measuring
test-retest reliability, it is necessary that the time interval between the test should also be mentioned
along with the reliability coefficient. This kind of reliability is ensured for aptitude tests and
achievement tests so that they measure the intended purpose each time they are administered.
6.2 Equivalence Reliability or inter-class reliability
It relates to two tests that are similar in every aspect except the test items. The reliability between
the two test is then measured and if the coefficient of reliability known as coefficient of
equivalence in this case is higher, then the two test are highly reliable and vice versa. It shall be
kept into consideration that the two tests shall be measuring the same variables, having the same

number of items, structure, difficulty level. Besides the direction for administering the two tests
shall also be same, with similar scoring style and interpretation. The purpose is to make the scoring
on both the tests consistent and reliable. No matter which test is taken by students, the score of the
students should be same on both the tests. This is usually used in situation where the number of
candidates are very large or a test is to taken on two occasions. In this kind of situation, the
examiner constructs different versions of the same test so that each group of students can be
administer the test at different time without the fear of test items leaking or repeating. In some
circumstances, the researchers ensured to make equivalence pre-test and post-
test to measure the actual difference in the performance removing the error in measurement
occurring from recalling/remembering the answers on the first test.
The procedure for establishing equivalence reliability is to construct the two versions of the test
measuring similar objectives taken from the same content area, number of items, difficulty level
etc. One form of the test is administered to an appropriate group. After some time, the second form
of the test is administered to the same group. The score obtained by students on both the test is then
correlated to find the coefficient of reliability. The difference in the score obtained by students
would be treated as error.
6.3 Split-Halves Reliability
This type of reliability is used for measuring internal consistency between the test items in a
single test. This is theoretically same as finding equivalence reliability however; here the two parts
are taken from the same test. This reliability can be found by administering the test only once and
thus the effect/error caused due to time interval or students’ condition (physical, mental, emotional
etc.) or two groups is minimized. As the name indicates, the test items for a single test are divided
into two halves to form two equivalent parts of a test. The two parts can be obtained by various

methods e.g. Dividing the test items into two halves with equal number of items in both the halves
or by splitting the test items into two halves, the odd number items and even number items.
In case, the test is divided into odd and even numbered items, the reliability is calculated as
follow. Firstly, the test is administered to subjects and the items are marked. The items are divided
into two halves by combining all the odd items in one half and the even items in the second half.
The score obtained on odd and even numbered items are separately totaled. Thus,
there are two set of scores for each student. The score obtained on odd and even numbered items.
The two scores are then correlated to find the correlation coefficient using Pearson product moment
correlation coefficient. If the value of correlation coefficient is higher, then the two parts of the test
are highly reliable and vice versa.
The reliability coefficient obtained from the correlation needs to be adjusted/corrected as this
coefficient was for a test which is divided into two (split-halves). The actual reliability of the
whole test needs to be higher. This is computed using Spearman-Brown prophecy formula.
Suppose the reliability coefficient for a 40 items test was .70 which was obtained by correlating the
score for 20 odd and 20 even items. Thus, the reliability coefficient for the whole test (40 items)
will be found using the following formula:
2 ~~~~~~~~~~~~ ~~~~~~~
r total test = 1 + ~~~~~~~~~~~~ ~~~~~~~
2 (.70)
r = .82
total test = 1 + .7 0
The advantage of split-halves reliability is that one test is used only once. Thus, it can be
economically and conveniently used by classroom teachers and researchers to collect data about a
test.
7. FACTORS AFFECTING RELIABILITY

Fatigue: The score obtained on a test by subjects during different conditions may be different. The
fatigue factor has an important role in affecting the test score. Thus, fatigue/tiredness affects test
reliability. Generally, the students will score less on a test with fatigue factor. Thus, Fatigue
generally decreases reliability coefficient.
Practice: The reliability of a test can be affected by the amount of practice. It is generally said that
practice makes a man perfect. In the same manner, practice on test will improve students’ score
and thus increases reliability coefficient on test with greater practice.
Subject variability: The variation in the scores will increase if in a group, there is more subject
variability. The greater in differences among subjects on the basis of gender, age, program,
interests etc., the greater will be the variation in the score among individuals. In the same way, if a
group is more homogenous such as group of students with same range of IQ, then the variation in
the score will be less.
Test Length: The length of test and the number of items affect reliability of a test. Usually, a test
with greater number of items may give more reliable scores due to the cancelling of random
positive and negative errors with in a test. Thus, adding more items to a test increases its reliability.
In the same manner, deleting items from a test will lower the reliability of a test. One technique of
deleting items from a test without decreasing its reliability is to remove that item from a test which
has lower reliability value in item analysis.
The Spearman-Brown prophecy formula is used for estimating reliability for a test which if made
shorter or longer provided that the original reliability of a test is given. For example, if a test
original reliability is .60 and the number of items are increased or decreased, then the new
reliability of the test will be:
rx = 1 + ( ~ ~ - 1 ) ~ ~
r = predicted reliability of a test with added or deleted number of items
x

r = reliability of original test
K = ratio of number of items in the new test to number of items in the original test
8. VALIDITY
Validity refers to the extent to which a test measure what it is supposed to measure. In other words,
it refers to the degree to which a test pertains to its objectives. Thus, for a measure or test to be
valid, it must measure the particular trait, characteristic, or ability consistently for which it was
constructed.
According to The Standards for Educational and Psychological Testing (AERA/APA/ NCME,
1985), Validity "refers to the appropriateness, meaningfulness, and usefulness of the specific
inferences made from a test”. If correct and true inferences can be derived from a test, then such
test has a greater validity to measure that specific inference.
Cohen and Swerdlik (2010) defined validity as “a judgment based on evidence about the
appropriateness of inferences drawn from test scores”. While, inference is a logical result or
deduction. When a test score is used to make inference about a person trait or characteristic, then
the test score is assumed as representing that trait or characteristic.
A test may be valid for a particular group and for a particular purpose however, it may not be valid
for another group or for a different purpose. A test on English grammar may be valid for a high
school group but it is not valid for university students. Moreover, no test or measurement technique
is “universally valid” for all time, for all uses, and for all user(Cohen & Swerdlik, 2010). Rather,
tests may be shown to be valid within what we would characterize as reasonable boundaries of a
contemplated usage. If those boundaries are exceeded, the validity of the test may be called into
question (Cohen & Swerdlik, 2010).
The process of gathering and evaluating evidence about validity is called ‘validation’. This is an
important phase of validity which the test developer has to take with the test takers for a specific

purpose. It is necessary that the test developer should mention the validity evidence in the test
manual for the users/readers. However, sometimes the test users may conduct their own studies to
check for validation with their test takers usually called local validation.
Some types of validity are:
1. Content-related validity
2. Criterion-related validity
3. Construct-related validity
8.1 Content-related validity
Sometimes content validity is also referred as face validity or logical validity. According to
Gay, Mills, & Airasian (2011), content validity is the degree to which a test measures an
intended content area. In order to determine content validity, the content domain of a test shall
be well defined and clear.
Face validity is a quick way of ascertaining whether a test looks/appears to measure what it
purports to measure. A primary class math test shall contain numbers and figures and shall
appear to be a math test rather than a language test. Face validity is not an exact measure of
estimating content validity and is only used as a quick way for initial screening of judging
validity.
In order to judge content validity, one must pay attention to ‘item validity’ and ‘sampling
validity’. Item validity ensures that the test items represents the relevant content area of a given
subject matter. The items in a math test shall include questions from its given content and shall
not focus on evaluating language proficiency or math items not included in the given syllabus.
Similarly, an English test shall not contain items related to mathematical formulae or cover the
subject matter of a science subject.

Sampling validity is concerned with how well the test items samples the total content area. A
test with good sampling validity ensures that the test items adequately samples the relevant
content area. The proportion of test items from various units must be kept into consideration
according to their importance. Although, all the units or concepts cannot be covered in a single
test, however, the examiner must ensure that the proportion of test items are in accordance with
the significance of the concepts to be tested. If a Physics test contains items from Energy
chapter only and ignore other chapters, then such test will have poor sampling validity.
Content validity can be judged by content expert, relevant subject teacher and/or text book
writer. According to Gay et al. (2011), content validity cannot be measured quantitatively
rather the experts carefully observe all the test items for item validity and sampling validity and
then make a judgement about its content validity. A good way of ensuring content validity is to
make a table of specifications that shall include the total number of units/topics to be tested, the
number of items from each unit/topic and the different domain of instructional objectives. The
table of specification helps in observing the units from which most of the items are included
and also units which were under represented or ignored.
Consider a secondary grade physics test taken from five chapter as given in table of
specification. The names of the units are mentioned and the number of test items assessing each
of the instructional objectives given by Bloom’s taxonomy. It is not a hard and fast rule to
strictly follow the given proportion. The examiner decide which aspect or instructional
objectives shall be given more or less weightage for each unit and still ensure that there shall
not be greater difference in the weightage assigned to each objective. Thus, some units may
require more focus on application side while some units may focus on knowledge or
comprehension. The objective is to rightly measure the skill that the examiner wants to
measure.

Table 2. Table of specification of Physics test from five units
Course
content
Knowledge
(30%)
Comprehension
(40%)
Application
(30)
Total
Forces 3 5 2 10
Energy
sources
3 4 3 10
Turning
Efect
2 4 4 10
Kinematics 3 3 4 10
Atomic
Structure
4 4 2 10
Total 15 20 15 50
8.2 Criterion-related validity
Other terms used for criterion-related validity is statistical validity or correlational validity. It
provides evidence that a test items measures a specific criterion or trait for which it is designed. In
order to determine criterion validity of a test, the first step is to establish the criterion to be
measured. Then a variety of test items are developed and then tested. The test items are then
correlated with the criterion to determine how well are these items measuring the set criterion
through finding Pearson correlation. In case, a number of test are used to measure the criterion,
then multiple correlational procedures are used instead of Pearson correlation.
Criterion-related validity can be further subdivided into concurrent validity and predictive
validity.
8.2.1 Concurrent Validity

The main difference between concurrent and predictive validity is the time at which the criterion is
measured. For concurrent validity, the criterion is measured at approximately the same time as the
alternative measure. However, if the criterion being measured relates to some future time, then it is
called predictive validity.
The concurrent validity of a test is the degree to which the score on the test is related to the score
on an already established test administered at the same time. For example, GRE is an already
standardized test for measuring some specific skills and knowledge. Suppose a new test is
developed that claims to be measuring the same skills and knowledge, then it is necessary to find
the concurrent validity of the new test. For this purpose, the new test and the already established
test will be administered to some defined group of individuals at the same time. The scores
obtained by individuals on both the test is correlated to observe for similarity or differences. The
coefficient of validity can be calculated from correlation which will provide information about the
concurrent validity of the new test. A high value of validity coefficient indicates a good concurrent
validity and vice versa.
8.2.2 Predictive Validity
It is the degree to which a test can predict about the future performance of an individual. It is often
used for selecting or grouping individuals. The score on entry test serves as predictive validity
about future performance of individuals in a specific program. If the marks on the entry test is
high, then it can be predicted that the candidate will do well in future thus ascertaining predictive
validity of the entry test. Such test may include ISSB test for entrance to armed forces, GRE test and
SAT test for university performance. Likewise, medical test reports such as high
body fat, high cholesterol, smoking and hypertension are all predictive of future heart disease. It
shall be kept into consideration that the predictive validity of various tests like entry test, GRE,
TOEFL etc. may vary due to a number of factors such as the difference in curriculum studied by

students, the textbooks used for preparation, the geographical location etc. Thus, there is no such
thing as perfect predictive validity which will also sometime makes the prediction false. Not all
students who pass GRE or entry test may successfully pass the program in which the individuals
were enrolled. Thus, it is not advisable to consider the score of a test for predicting future
performance rather several indicators shall be used such as marks in preceding exams, the
interview score, comments of professors, performance on practical skills etc.
8.3 Construct-related validity
Construct-related validity is used to measure a theoretical construct. The construct to be measure is
unobservable yet it exists theoretically. The construct though cannot be seen but its effects can be
observed. For example, intelligence quotient (IQ), anxiety, creativity, attitude etc. Tests have been
developed for measuring a specific construct. The researchers/test developers ensure that the test
they construct should accurately measure that specific construct for which it was designed. Thus, a
test aimed at measuring level of anxiety shall not measure creativity or IQ. The test score can be
used to make decision related to a construct. If a test is unable to measure a construct, then its
validity is questionable and the conclusion based on its score will be meaningless and inaccurate.
The process of determining construct validity is not simple. The measuring of a construct requires
a strong theory that hypothesize about the construct under study. For example, Psychology theories
hypothesize that individuals with higher anxiety persons will work longer on a problem as
compared to persons with low anxiety level. Suppose a test that measures

anxiety level and some persons score higher on such test and then the same persons also worked
for a longer time on the task/problem under consideration; then we have ample evidence to
support the theory and thus the construct validity of the test to measure that construct.
Figure: Validity and Reliability
Source: James, Allen, James, & Dale (2005)
Self-Assessment Questions
Q1. How is achievement test different from attitude scale?
Q2. Describe the uses of achievement test and attitude scale?
Q3. What are the steps for developing a test?
Q4. Define reliability and reliability coefficient?
Q5. Describe the different types of reliability?
Q6. What are the factors that affect reliability?
Q7. Define the concept of validity in measurement and its relation with reliability?
Test-Retest Equivalence Anova Concurrent Predictive
Alpha KR
20

Q8. Explain the different types of test validity.
Q9. What are the qualities of a good test?
9. REFERENCES
Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation,
Cambridge University Press, 1995
Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests and
measurement. 7th ed. McGraw-Hill Primis; 2010.
Gay, L. R., Mills, G. E., & Airasian, P. W. (2011). Educational research: Competencies for
analysis and applications. Pearson Higher Ed.
http ://www.alleydog.com/glossary/definition.php?term=Reliability%20Coefficient#ixzz48
EmyHlQe
James, R. M., Allen, W. J., James, G. D., & Dale, P. M. (2005). Measurement and
evaluation in human performance. USA: Human Kinetics.
McMillin, E. (2013). Steps to Developing a Classroom Achievement Test. Assessed from
https ://prezi.com/fhtzfkrreh6p/steps-to-developing-a-classroom-achievement-test/#
Mohamed, R. (2011). 12 Characteristics of a good test. Retrieved from
https ://eltguide.wordpress.com/2011/12/28/1 2-characteristics-of-a-good-test/
Reynolds, C. R., Livingston, R. L., & Willson, V. L. (2009). Measurement and Assessment in
Education. (2nd ed.). Upper Saddle River, NJ: Pearson Education Inc.
Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning
distractors in multiple-choice questions: a descriptive analysis,
http://www.springerlink.com/content/e8k861 8552465484/fulltext.pdf, 2009.

Unit. 7.doc

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Unit. 7.doc

Ähnlich wie Unit. 7.doc (20)

Mehr von Imtiaz Hussain

Mehr von Imtiaz Hussain (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Unit. 7.doc