ED 106 - Module 6
ED 106 - Module 6
ED 106 - Module 6
ASSESSMENT OF
LEARNING
1
LANIE N.
E.
Assessment of Learning 1 (Module_Ed 106) lnea 2022 | 1
AVELINO
, LPT, MA
ASSESSMENT OF LEARNING 1
MODULE SIX
PRELIMINARIES
COURSE DESCRIPTION/OVERVIEW
This course provides an examination of the uses of assessment practices and strategies to improve student
learning. Special emphasis will be placed on authentic assessment practices, standardized tests, and developmental
screenings. Additionally, students will become familiar with measures to assess learners with special needs and
learners from linguistically and culturally different backgrounds.
This course also focuses on the development and utilization of assessment tools to improve the teaching-
learning process. It emphasizes the use of testing for measuring knowledge, comprehension and other thinking skills.
It allows the students to go through the standard steps in test construction for quality assessment.
Pre-requisites: N/A
LEARNING OUTCOMES
This course/module made a concerted effort to achieve the following learning outcomes;
Explain the meaning of item analysis, item validity, reliability, item difficulty, discrimination index
Determine the validity and reliability of given test items
INDICATIVE CONTENT
MODULE SIX - ITEM ANALYSIS AND VALIDATION
DISSCUSSION
Introduction
The teacher normally prepares a draft of the test. Such a draft is subjected to item analysis and validation in order to
ensure that the final version of the test would be useful and functional. First, the teacher tries out the draft test to a
group of students of similar characteristics as the intended test takers (try-out phase). From the try-out group, each
item will be analyzed in terms of its ability to discriminate between those who know and those who do not know and
also its level of difficulty (item analysis phase).
The item analysis will provide information that will allow the teacher to decide whether to revise or replace an item
(item revision phase). Then, finally, the final draft of the test is subjected to validation if the intent is to make use of
the test as a standard test for the particular unit or grading period. We shall be concerned with these concepts in this
Chapter.
Lesson 1. Item Analysis
Assessment of Learning 1 (Module_Ed 106)
lnea 2022 | 2
There are two important characteristics of an item that will be of interest to the teacher. These are: a) item difficulty,
and b) discrimination index. We shall learn how to measure these characteristics and apply our knowledge in making
a decision about the item in question.
The difficulty of an item or item difficulty is defined as the number of students who are able to answer the item
correctly divided by the total number of students. Thus:
Item difficulty = number of students with correct answer/ total number of students
Example: What is the item difficulty index of an item if 25 students are unable to answer it correctly while 75
answered it correctly?
Here, 100 is the total numbers of student hence, the item difficulty index is 75/100 or 75%.
One problem with this type of difficulty index is that it may not actually indicate that the item is difficult (or easy). A
student who does not know the subject matter will naturally be unable to answer the item correctly even if the
question is easy. How do we decide on the basis of this index whether the item is too difficult or too easy? The
following arbitrary rule is often used un the literature:
Difficult items tend to discriminate between those who know and those who do not know the answer. Conversely,
easy items cannot discriminate between these two groups of students. We are therefore interested in deriving a
measure that will tell us whether an item can discriminate between these two groups of students. Such a measure is
called an index of discrimination.
An easy way to derive such a measure is to measure how difficult an item is with respect to those in the upper 25%
of the class and how difficult it is with respect to those in the lower 25% of the class. If the upper 25% of the class
found the item easy yet the lower 25% found it difficult, then the item can discriminate properly between these two
groups. Thus:
Index of discrimination = DU - DL
Example: Obtain the index of discrimination of an item. If the upper 25% of the class had a difficulty index of 0.60
(i.e., 60% of the upper 25% got the correct answer) while the lower 25% of the class had a difficulty index of 0.20.
Here, DU = 0.60 while DL = 0.20, thus index of discrimination = .60 - .20 = .40.
Theoretically, the index of discrimination can range from -1.0 (when DU =0 and DL =1). When the index of
discrimination is equal to -1, then this means that all of the lower 25% of the students got the correct answer while
all of the upper 25% got the wrong answer. In a sense, such an index discriminates correctly between the two groups
but the item itself is highly questionable. Why should the bright ones get the wrong answer and the poor ones get the
right answer? On the other hand, if the index of discrimination is 1.0, then this means that all of the lower 25% failed
to get the correct answer while all of the upper 25% got the correct answer. This is a perfectly discriminating item
and is the ideal item that should be included in the test. From these discussions, let us agree to discard or revise all
items that have negative discrimination index for although they discriminate correctly between the upper and lower
25% of the class, the content of the item itself may be highly dubious. As in the case of the index of difficulty, we
have the following rule of thumb:
Example: Consider a multiple-choice type of test of which the following data were obtained:
Item Options
A B* C D
0 40 20 20 Total
1 0 15 5 0 Upper 25%
0 5 10 5 Lower 25%
The correct response is B. Let us compute the difficulty index and index of discrimination:
DU = no. of students in upper 25% with correct response/ no. Of students in the upper 25%
= 15/20 = .75 or 75%
DL = no. Of students in lower 75% with correct response/ no. Of students in lower 25%
= 5/20 = .25 or 25%
It is also instructive to note that the distracter A is not an effective distracter since this was never selected by the
students. Distracters C and D appear to have a good appeal as distracters
The Michigan State University Measurement and Evaluation Department reports a number of item statistics which aid
in evaluating the effectiveness of an item. The first of these is the index of difficulty which MSU defines as the
proportion of the total group who got the item wrong. Thus, a high index indicates a difficult item and a low index
indicates an easy item. Some item analysts prefer an index of difficulty which is the proportion of the total group who
got an item right. This index may be obtained by marketing the PROPORTION RIGHT option on the item analysis
header sheet. Whichever index is selected us shown as the INDEX OF DIFFICULTY on the item analysis print-out. For
classroom achievement tests, most test constructors desire items with indices of difficulty no lower than 20 nor higher
than 80, with an average index of difficulty from 30 or 40 to a maximum of 60.
The INDEX OF DISCRIMINATION is the difference between the proportion of the upper group who got an item right
and the proportion of the lower group who got the item right. This index is dependent upon the difficulty of an item.
It may reach a maximum value of 100 for an item with an index of difficulty of 50, that is, when 100% of the upper
group and none of the lower group answer the item correctly. For items of less than or greater than 50 difficulty the
index of discrimination has a maximum value of less than 100. Interpreting the Index of Discrimination document
contains a more detailed discussion of the index of discrimination."
The item discrimination index provided by ScorePak® is a Pearson Product Moment correlation between student
responses to a particular item and total scores on all other items on the test. This index is the equivalent of a point-
biserial coefficient in this application. It provides an estimate of the degree to which am individual item is measuring
the same thing as the rest of the items.
Because the discrimination index reflects the degree to which an item and the test as a whole are measuring a unitary
ability or attribute, values of the coefficient will tend to be lower for tests measuring wide range of content areas than
for more homogeneous tests. Item discrimination indices must always be interpreted in the context of the type of test
which is being analyzed. Items with low discrimination indices are often ambiguously where did and should be
examined. Items with negative indices should be examined to determine why a negative value was obtained. For
example, a negative value may indicate that the item was miss-keyed, so that students who knew the material tended
to choose an unkeyed, but correct, response option.
Tests with high integral consistency consists of items with mostly positive relationships with total test score. In
practice, values of the discrimination index will seldom exceed the .50 because of the differing shapes of items and
total score distributions. ScorePak® classifies item discrimination as "good" if the index is above .30; "fair'" if it is
between .10 and .30; and "poor" if it is below .10.
A good item is one that has good discriminating ability and has sufficient level of difficult (not too difficult nor too
easy). In the two tables presented for the levels of difficulty and discrimination there is a little area of intersection
where the two indices will coincide (between 0.56 to 0.67) which represent the good items in a test.
At the end of the item analysis report test items are listed according to their degrees of difficulty (easy medium hard)
and discrimination (good fair poor). These distributions provide a quick overview of the test and can be used to
identify items which are not performing well and which can perhaps be improved or discarded.
Summary
Index of Difficulty
Ru+ RL
P=
1 Where:
T
2
RU - the number in the upper group who answer the item correctly.
RL - The number in the lower group who answer the item correctly.
T - The total number who try the item.
8
P= x 100=40 %
20
The discriminating power of an item is reported as a decimal fraction; maximum discriminating power is indicated by
an index of 1.00.
Lesson 2. Validation
After performing the item analysis and revising the items which need revision, the next step is to validate the
instrument. The purpose of validation is to determine the characteristics of the whole test itself, namely, the validity
and reliability of the test. Validation is the process of collecting and analyzing evidence to support the meaningfulness
and usefulness of the test.
Validity. Validity is the extent to which a test measures what it purports to measure or as referring to the
appropriateness, correctness, meaningfulness and usefulness of the specific decisions a teacher makes based on the
test results. These two definitions of validity differ in the sense that the first definition refers to the test itself while
the second refers to the decisions made by the teacher based on the test. A test is valid when it is aligned to the
learning outcome.
A teacher who conducts test validation might want to gather different kinds of evidence. There are essentially three
main types of evidence that may be collected: content-related evidence of validity, criterion-related evidence of
validity and construct-related evidence of validity. Content-related evidence of validity refers to the content and
format of the instrument. How appropriate is the content? how comprehensive? Does it logically get at the intended
variable? How adequately does a sample of items or questions represent the content to be assessed?
Criterion-related evidence of validity refers to the allure lesion ship between scores obtained using the instrument and
the scores obtained using one or more other tests (often called criterion). How strong is this relationship? How well do
such scores estimate present or predict future performance of a certain type?
Construct-related evidence of validity refers to the nature of psychological construct or characteristic being measured
by the test. How well does a measure of the construct explain differences in the behavior of the individuals or their
performance on a certain task?
The usual procedure for determining content validity may be described as follows: The teacher writes out the
objectives of the test based on table of specifications and then gives this together with the test to at least two (2)
experts along with a description of the intended test takers. The experts look at the objectives, read over the items in
the test and place a check mark in front of each question or item that they feel does not measure one or more
objectives. They also place a checkmark in front of each object if not assessed by any item in the test. The teacher
then rewrites any item so checked and resubmit to the experts and/or writes new items to cover those objectives not
Assessment of Learning 1 (Module_Ed 106)
lnea 2022 | 6
hereto for covered by the existing test. This continues until the experts approve of all items and also until the experts
agree that all of the objectives are sufficiently covered by the test.
In order to obtain evidence of criterion-related validity, lee teacher usually compare scores on the task and questions
with the scores on some other independent criterion test which presumably has already high validity. For example, if
a test is designed to measure mathematics ability of students and it correlates highly standardized mathematics
achievement test (external criterion), then we say we have high criterion-related evidence of validity. In particular,
this type of criterion-related validity is called its concurrent validity. Another type of criterion-related validity is called
predictive validity wherein the test scores in the instrument or curly add with scores on a later performance (criterion
measure) of the students. For example, the mathematics ability test constructed by the teacher may be correlated
with earlier performance in a division wide mathematics achievement test.
Apart from the use of correlation coefficient in measuring criterion-related validity, Gronlund suggested using the so-
called expectancy table. This table is easy to construct and consists of the test (predictor) categories listed on the left-
hand side and the criterion categories listed horizontally long to top of the chart. For example, suppose that a
mathematics achievement test is constructed in the scores are categorized as high, average and low. The criterion
measure used is the final average grades of the students in high school: Very Good, Good, and Needs Improvement.
The two-way table lists down the number of students falling under each of the possible pairs of (test, grade) as
shown below:
high 20 10 5
average 10 25 5
low 1 10 14
The expectancy table shows that there were 20 students getting high test scores and subsequently rated excellent in
terms of their final grades; 25 students got average scores and subsequently rated good in their finals; and finally, 14
students obtained low test scores and were later graded as needing improvement. The evidence for this particular
test tends to indicate that students getting high scores on it would be graded excellent; average scores on it would be
rated good later; and students getting low scores on the test would be graded as needing improvement later.
We will not be able to discuss the measurement of construct related validity in this book since the method to be used
to require sophisticated statistical techniques falling in the category of factor analysis.
Lesson 3. Reliability
Reliability refers to the consistency of the scores obtained – how consistent they are for each individual from one
administration of an instrument to another and from one set of items to another. We already gave the formula for
computing the reliability of a test: for internal consistency; for instance, we could use the split-half method or the
Kuder-Richardson formulae. (KR-20 or KR-21)
Reliability and validity are related concepts. If an instrument is unreliable, it cannot yet valid outcomes. As reliability
improves, validity may improve (or it may not). However, if an instrument is shown scientifically to be valid then it is
almost certain that it is also reliable.
The following table is a standard followed almost universally and educational tests and measurement.
Reliability Interpretation
.90 and above Excellent reliability; at the level of the best standardized tests.
.70-80 Good for a classroom test; in the range of most. There are probably a few items which could
be improved.
.60-.70 Somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to
Assessment of Learning 1 (Module_Ed 106)
lnea 2022 | 7
determine grades. there are probably some items which could be improved.
.50-.60 Suggest need for revision of test, unless it is quite short (10 or fewer items). Test definitely
needs to be supplemented by other measures (e.g., more tests) for grading.
.50 or below Questionable reliability. This test should not contribute heavily to the course grade, and it
needs revision.
ACTIVITY
1. A teacher constructed a test which would measure the student’s ability to apply previous knowledge to certain
situations. In particular, the evidence to the student is able to apply previous knowledge are:
• Draw correct conclusions that are based on the information given;
• Identify one or more logical implications to follow from a given point of view;
• State whether to ideas are identical, just similar, unrelated or contradictory.
• Right test items using the multiple-choice type of test that would cover these concerns of the teacher. Show your
test to an expert and ask him to judge whether the items in indeed cover these concerns.
2. What is an expectancy table? Describe the process of constructing an expectancy table. When do we use an
expectancy table?
3. Enumerate the three types of validity evidence. Which of these types of validity is the most difficult to measure?
Why?
4. What is the relationship between validity and reliability? Can a test be reliable and yet not valid? Illustrate.
5. Discuss the different measures of reliability. Justify the use of each measure in the context of measuring reliability.
REFERENCES
Assessment of Learning 1
Navarro, Rosita PhD; Santos, Rosita PhD; Corpuz, Brenda PhD (2017)
Assessment of Learning 1
De Guzman, Estefania S. PhD et.al.2015