• In today’s language classrooms, the term
assessment usually evokes images of an end-of-
course paper-pencil test designed to tell both
teachers and students how much material the
student doesn’t know or hasn’t yet mastered
• It includes a broad range of activities and
tasks that teachers use to evaluate student’s
progress and growth on a daily basis.
To make use of evaluation, assessment and test procedures more
effective it is necessary to clarify what these concepts are and to
explain how they differ from one another
It is all-inclusive and it is the widest basis for collecting information
in education.
It involves looking at all factors that influence the learning process:
syllabus, objectives, course design, and materials.
Test is a subcategory of assessment, it is a formal systematic
procedure used to gather information about student progress.
Assessment is part of evaluation because it is concerned with
the student and with what the student does. It refers to
the variety of ways of collecting information on a learner’s
language ability or achievement.
The most common use of language tests is to
identify strengths and weaknesses in student’s
abilities
Information gleaned from tests also assist us in
deciding who should be allowed to participate
in a particular course or program area.
Another common use of tests is to provide
information about effectiveness of programs
instructions
• They asses student’s level of language abilities so they can be placed
in an appropriate course or class. This type of test indicated the level
at which a student will learn most effectively. The primary aim is to
create groups of learners that are homogeneous in level
• They measures capacity or general ability to learn a foreign language.
(Although not commonly used these days)
• They identify language area in which student needs further help. The
information gained from diagnostic tests are crucial for further
course activities and providing students with remediation.
• They measures the progress that students are making
toward defined course or program goals. Progress tests
are generally teacher produced because they cover less
material and assess fewer objectives
• They are similar to progress tests. They are usually
administrated at the mid- and end- point of the semester or
academic year.
• The content is generally based on the specific course content
or on the course objectives.
• They assess the overall language ability of students at
varying levels.
• They tell us how capable a person is in a particular
language skill area.
• Objective versus subjective tests- sometimes tests
are distinguished by the manner in which they are
scored by comparing a student’s responses with an
established set of acceptable/correct responses on an
answer key. With objectively scored tests, the scorer
does not require particular knowledge or training in
the examined area
• In contrast, a subjective test, such as writing an
essay, requires scoring by opinion or personal
judgment so the human element is very important.
• Even experienced scorer need moderated training
sessions to ensure inter-rater reliability
Criterion referenced tests versus Standardized
tests-
• Criterion referenced tests are usually developed to
measure mastery of well-defined instructional objectives
specific for a particular course or program. Their
propose is to measure how much learning has occurred.
Students performance is compared only to the amount or
percentage of material learned.
• Standardized tests are designed to measure global
language abilities. Students’ scores are interpreted
relative to all other students who take the exam. Their
purpose is to spread students out along a continuum of
scores so that those with low abilities in a certain skill
are at one end of the normal distribution and those with
high scores are at the other end, with the majority of
the students falling between extremes.
Summative versus formative tests-
• Tests or tasks administered at the end of the course to
determine if students have achieved the objectives set
out in the curriculum are called summative assessments.
they are often used to decide which students move on to
a higher level
• Formative assessments however, are carried out with
the aim of using the results to improve instruction, so
they are given during course and feedback is provided to
students.
High-stakes versus Low-stakes tests-
• High-stakes tests are those in which the results are
likely to have major impact on the lives of large number
individuals or an large programs.
• Low-stakes tests are those in which the results have
relatively minor impact on the lives of the individual or
on small programs. In class progress tests or short
quizzes are examples of low-stakes tests
• Designed items
• Subjective test
• Rating
• Test item itself
• Conditions of
administration
(noise, light, temperat
ure, desks, chairs)
• Human error
• Subjectivity
• Bias toward “good”
and “bad students”
• Inexperience
• Inattention
• Temporary
illness, fatigue, an
xiety (other
physical, psycholo
gical factors)
Student-
related
Reliability
Rater
Reliability
Test
reliability
Test
Administr
ation
Reliability
Validity
- Measures exactly what is proposed to measure.
- Involves performance that samples the test the test’s criterion.
- Offers useful, meaningful information about a test-taker’s abilities.
- Is supported by an argument.
Criterion-
related
validityConstruct-
related
validity
Consequential
validity (Impact)
Content-related
validity
Introduction
• Objectives will include 4 distinct components:
Audience, Behavior, Condition and Degree.
• Objectives must be both observable and measurable to be effective.
• Use of words like understand and learn in writing objectives are
generally not acceptable as they are difficult to measure.
• Written objectives are a vital part of instructional design because they
provide the roadmap for designing and delivering curriculum.
• Throughout the design and development of curriculum, a comparison
of the content to be delivered should be made to the objectives
identified for the program. This process, called performance
agreement, ensures that the final product meets the overall goal of
instruction identified in the first level objectives.
- Describe the intended learner or end user of the instruction
- Often the audience is identified only in the 1st level of objective
because of redundancy
Describes learner capability
Must be observable and measurable (you will define the measurement elsewhere in the
goal)
If it is a skill, it should be a real world skill
The “behavior” can include demonstration of knowledge or skills in any of the domains
of learning: cognitive, psychomotor, affective, or interpersonal
- Equipment or tools that may (or may not) be utilized in completion
of the behavior
- Environmental conditions may also be included
- States the standard for acceptable performance
(time, accuracy, proportion, quality, etc)
The common mistakes have been
grouped into four categories as
follows:
- General examination characteristics.
- Item characteristics.
- Test validity concerns.
- Administrative and scoring issues.
General
Examination
Characteristics
Item
characteristics
Test-validity
concerns
Administrative and scoring
issue: Lack of cheating control
Inadequate instruction
Administrative inequities
Lack of piloting
Subjectivity of scoring
• Too difficult or too easy
• Insufficient nr of items
• Redundancy of test type
• Lack of confidence measure
• Negative wash back through non-
occurrent forms
• Tricky questions
• Redundant wording
• Divergence cues
• Convergence cues
• Option number
• Mixed content
• Wrong medium
• Common knowledge
• Syllabus mismatch
• Content matching
Tradition assessment
Pencil-and-paper test.
Answer the question
Choose or produce a
correct grammatical form
or vocabulary item.
Good to check reading and
listening comprehension
ability
Alternative assessment
• Reveal what students can
do with language
• It is scored differently
• Students can evaluate their
own learning and learn from
the evaluation process
• Gives instructors a way to
connect assessment with
review of learning
strategies
- They are build around the
topics of the interest to the
students
- They replicate real-world
communication context and
situations
-They require students to
produce a quality product or
performance
-The evaluation criteria and
standards are known to the
student
- They involve multi-stage
tasks and real problems that
require creative use of
language rather than simple
repetition
-They involve interaction
between assessor and
person assessed
They allow for self-evaluation
Rubrics- provide measurement of quality of
performance on the basis of established
criteria.
There are four main types of rubrics:
• Holistic rubrics
• Analytic rubrics
• Primary trait rubrics
• Multi-trait rubrics
In holistic evaluation, raters
make judgments by forming an overall impression of a performance and
matching it to the best fit from among the descriptions on the scale.
• They are often written generically
and can be used with many tasks.
• They emphasize what learners can
do, rather than what they cannot do.
• They save time by minimizing the
number of decisions raters must
make.
• Trained raters tend to apply them
consistently, resulting in more reliable
measurement.
• They are easily understood by
younger learners.
• They do not provide specific
feedback to test takers about the
strengths and weaknesses of their
performance.
• Performances may meet criteria in
two or more categories, making it
difficult to select the one best
description. (If this occurs
frequently, the rubric may be poorly
written.)
Analytic scales are
usually associated with generic rubrics and tend to focus on broad dimensions of
writing or speaking performance. These dimensions may be the same as those found
in a generic, holistic scale, but they are presented in separate categories and rated
individually. Points may be assigned for performance on each of the dimensions and a
total score calculated.
• They provide useful feedback
to learners on areas of
strength and weakness.
• Their dimensions can be
weighted to reflect relative
importance.
• They can show learners that
they have made progress
over time in some or all
dimensions when the same
rubric categories are used
repeatedly
• They take more time to
create and use.
primary trait scoring would be strictly
classified as task-specific, and performance would be evaluated on only one trait, such
as the "Persuading an audience
Ex. Primary Trait: Persuading an audience
0 Fails to persuade the audience.
1 Attempts to persuade but does not provide sufficient support.
2 Presents a somewhat persuasive argument but without consistent
development and support
3 Develops a persuasive argument that is well developed and supported.
multiple trait scoring rubrics are based on the concepts of primary trait
scoring, to provide diagnostic feedback to learners about performance on
"context-appropriate and task-appropriate criteria" for a specified topic.
• The rubrics are aligned with the task and curriculum.
• Aligned and well-written primary and multiple trait rubrics can ensure
construct and content validity of criterion-referenced assessments.
• Feedback is focused on one or more dimensions that are important in the
current learning context.
• With a multiple trait rubric, learners receive information about their strengths
and weaknesses.
• Primary and multiple trait rubrics are generally written in language that
students understand.
• Teachers are able to rate performances quickly.
• Many rubrics of this type have been developed by teachers who are willing
to share them online, at conferences, and in materials available for
purchase.