Module in Assessment in Learning 1 Final
Module in Assessment in Learning 1 Final
Assessment in Learning 1
The topics in this course are divided into three modules, each module with three
lessons for a total of nine (9) lessons to cover all the suggested topics to be discussed in
the class given the target learning outcomes of the course.
The second module focuses on the Process in the Development and Administration
of Tests. This chapter has three lessons as well: first is Planning a Written Test, which
could enable teachers to design a good test table of specifications; second is Construction
of a Written Test, which sets the guidelines in constructing different test formats, third is on
Establishing Test Validity and Reliability, which provides the necessary input to ensure that
the test constructed measures whit it intends to measure and provides reliable results.
This compilation of modules has been designed to have the following features: a)
Outcome-based; b) PSG’s aligned; c Standards-based; d) 21 st Century skills and
strategies-focused; and e) Whole-child sensitive.
Each lesson in the three modules has been designed to follow the UPDATERS
Framework, where each letter has the following meaning and feature: U – understand; P –
prepare; D – develop; A –apply; T – transfer; E – evaluate; R – reflect; and S – sustain.
It is hoped that the features of this module will make your learning of the first part of
assessment course, that is, Assessment in Learning 1, meaningful, engaging, and
challenging. Your learning in this course will be a good foundation for the course
Assessment in Learning 2.
The Author
TABLE OF CONTENTS
Title Page
Copyright Page
Preface
What are the general guidelines in choosing the appropriate test format?
What are the major categories and formats of traditional tests?..
What are the general guidelines in writing Multiple-choice test items
What are the general guidelines in writing matching-type items?,.
What are the general guidelines in writing true or false items/..
What are the general guidelines in writing short-answer test items?
What are the general guidelines in writing essay tests?. .
What are the general guidelines in problem-solving test items? ....
Lesson 6: Establishing Test Validity and Reliability.
What are the purposes of grading and reporting learners' test performance?
What are the different methods in scoring tests or performance tasks?.
What are the different types of test scores?.
What are the general guidelines in grading tests or performance tasks?.
What are the general guidelines in grading essay tests?.
What is the new grading system of the Philippine K-12 Program?.
How should test results be communicated to different stakeholders?
References
Module 1
Source:
Overview
The word assessment is rooted in the Latin word “assidere” which means “to sit
beside another.” Assessment is generally defined as the process of gathering quantitative
and/or qualitative data for the purpose of making decisions.
The most common form of assessment is testing. In the educational context, testing
refers to the use of a test or battery of tests to collect information on student learning over
a specific period of time. A test can be categorized as either a selected response
(matching type of test) or constructed response (essay test or short answer test). A test
can make use of objective format (multiple choice, enumeration) or subjective format
(essay). A test to be good and effective should be valid, reliable, has acceptable level of
difficulty, and can discriminate between learners with higher and lower ability. Teachers
are expected to be competent in the design and development of classroom tests.
Assessment and Grading
The two most common psychometric theories that serve as frameworks for
assessment and measurement, especially in the determination of the psychometric
characteristics of a measure (tests, scale) are the classical test theory (CTT) and the
item response theory (IRT).
The CTT (known as true score theory) explains that variations in the performance of
examinees on a given measure is due to variations in their abilities in which there are
some degree of errors in the measurement caused by some internal and external
conditions. Hence, the CTT also assumes that all measures are imperfect, and the score
obtained from a measure could differ from the true score (or true ability) of an examinee.
The CTT provides an estimation of the item difficulty based on the frequency or
number of examinees who correctly answer a particular item; items with fewer number of
examinees with correct answers are considered more difficult. The CTT also provides an
estimation of item discrimination based on the number of examinees with higher or lower
ability to answer a particular item. If an item is able to distinguish between examinees with
higher ability (higher total test score) and lower ability (lower test score) then an item is
considered to have good discrimination. Test reliability can also be estimated using
approaches from CTT (Kuder Richardson 20, Cronbach’s Alpha). Item analysis based on
CTT has been the dominant approach because of the simplicity of calculating the statistics
(item difficulty index, item discrimination index, and item-total correlation).
The IRT on the other hand, analyzes test items by estimating the probability that an
examinee answers an item correctly or incorrectly. One of the central differences of IRT
from CTT is that in IRT, the characteristic of an item can be estimated independently of the
characteristic or ability of the examinee and vice-versa. Aside from item difficulty and item
discrimination indices, IRT analysis can provide significantly more information on items
and tests, such as fit statistics, item characteristic curve (ICC), and test characteristic
curve (TCC).
2.
3.
4.
5.
Overview:
In the example, differentiate is the verb that represents the type of cognitive process
(in this case, analyze), while qualitative research and quantitative research is the noun
phrase that represents the type of knowledge (in this case, conceptual).
Tables 2.2 and 2.3 present the definition, illustrative verbs, and sample objectives of
the cognitive process dimensions and knowledge dimensions of the Revised Bloom's
Taxonomy.
Learning Targets
There are two types of test based on how the scores are
interpreted: norm-referenced and criterion-referenced tests.
Criterion-referenced test has a given set of standards, and the
scores are compared to the given criterion. For example, in a 50-
item test: 40-50 IS very high, 30-39 Is high, 20-29 is average and
10-19 is low, and 0-9 Is very low. One approach in criterion-
referenced interpretation is that the score is compared to a specific
cutoff, An example is the grading in schools where the range of
range of grades 96-100 is highly proficient, 90-95 1S proficient, 80-
89 S nearly proficient, and below 80 Is beginning. The norm-
referenced test interprets results using the distribution of scores of a
sample group. The mean and standard deviations are computed for
the group. The standing of every individual in a norm-referenced
test is based on how far they are from the mean and standard
deviation of the sample. Standardized tests usually interpret scores
using a norm set from a large sample. Having an established norm
for a test means obtaining the normal or average performance in
the distribution of scores. A normal distribution is obtained by
increasing the sample size. A norm is a standard and is based on a
very large group of samples. Norms are reported in the manual of
standardized tests. A normal distribution found in the manual takes
the shape of a bell curve. It shows the number of people within a
range of scores. It also reports the percentage of people with
particular scores. The norm is used to convert a raw Score into
standard scores for interpretability.
Classification Type
Educational
Purpose
Psychological
Paper and Pencil
Form
Performance based
Teacher-made
Function
Standardized
Achievement
Kind of Learning
Aptitude
Speed
Ability
Power
Norm-referenced
Interpretation of Learning
Criterion-referenced
There are two types of test based on how the scores are
interpreted: norm-referenced and criterion-referenced tests.
Criterion-referenced test has a given set of standards, and the
scores are compared to the given criterion. For example, in a 50-
item test: 40-50 IS very high, 30-39 Is high, 20-29 is average and
10-19 is low, and 0-9 Is very low. One approach in criterion-
referenced interpretation is that the score is compared to a specific
cutoff, An example is the grading in schools where the range of
range of grades 96-100 is highly proficient, 90-95 1S proficient, 80-
89 S nearly proficient, and below 80 Is beginning. The norm-
referenced test interprets results using the distribution of scores of a
sample group. The mean and standard deviations are computed for
the group. The standing of every individual in a norm-referenced
test is based on how far they are from the mean and standard
deviation of the sample. Standardized tests usually interpret scores
using a norm set from a large sample. Having an established norm
for a test means obtaining the normal or average performance in
the distribution of scores. A normal distribution is obtained by
increasing the sample size. A norm is a standard and is based on a
very large group of samples. Norms are reported in the manual of
standardized tests. A normal distribution found in the manual takes
the shape of a bell curve. It shows the number of people within a
range of scores. It also reports the percentage of people with
particular scores. The norm is used to convert a raw Score into
standard scores for interpretability.
Desired Significant Learning Outcomes: In this lesson you are expected to:
1. set appropriate instructional objectives for a written test; and
2. prepare a Table of Specifications for a written test.
To be able to learn or enhance your skills in planning for a good classroom test, you
need to review your knowledge on lesson plan development, constructive alignment, and
different test formats. It is suggested that you read books and other references in print or
online that could help you design a good Written test.
Why do you need to define the test objectives or learning outcomes targeted for
assessment?
In designing a well-planned written test, first and foremost, you should be able to
identify the intended learning outcomes in a course, where a written test is an appropriate
method to use. These learning outcomes are knowledge, skills, attitudes, and values that
every student should develop throughout the course. Clear articulation of learning
outcomes is a primary consideration in lesson planning because it serves as the basis for
evaluating the effectiveness of the teaching and learning process determined through
testing or assessment. Learning objectives or outcomes are measurable statements that
articulate, at the beginning of the course, what students should know and be able to do or
value as a result of taking the course. These learning goals provide the rationale for the
curriculum and instruction. They provide teachers the focus and direction on how the
course is to be handled, particularly in terms of
course content, instruction, and assessment, On the other hand, they provide the students
with the reasons and motivation to study and persevere. They give students the
opportunities to be aware of what they need to do to be successful in the course, take
control and ownership of their progress, and focus on what they should be learning.
Setting objectives for assessment is the process of establishing direction to guide both the
teacher in teaching and the
student in learning.
Ensures that the instructional objectives and what the test captures match.
Ensures that the test developer will not overlook details that are considered
essential to
a good test
Makes developing a test easier and more efficient
Ensures that the test will sample all important content areas and processes
Is useful in planning and organizing
Offers an opportunity for teachers and students to clarify achievement expectations.
1. Determine the objectives of the test. The first step Is to identify the test objectives.
This should be based on the instructional objectives. In general, the instructional objectives
or the intended learning outcomes are identified at the start, when the teacher creates the
course syllabus. There are three types of objectives: (1) cognitive, (2) affective, and (5)
psychomotor. Cognitive objectives are designed to increase an individual’s knowledge,
understanding, and awareness. On the other hand, affective objectives aim to change an
individual’s attitude into something desirable, while psychomotor objectives are designed
to build physical or motor skills. When planning for assessment, choose only the objectives
that can be best captured by a written test. There are objectives that are not meant for a
written test. For example, if you test the psychomotor domain, it is better to do a
performance-based assessment. There are also cognitive objectives that are sometimes
better assessed through performance-based assessment. Those that require the
demonstration or creation of something tangible like projects would also De more
appropriately measured by performance-based assessment. For a Written test, you can
consider cognitive objective5, ranging from remembering to. creating or ideas, that could
be measured using common formats for testing, such as multiple choice, alternative
response test, matching type, and even essays or open-ended tests.
2. Determine the coverage of the test. The next step in creating the TOS Is to determine
the contents of the test. Only topics or contents that have been discussed in class and are
relevant should be included in the test.
3. Calculate the weight for each topic. Once the test coverage is determined, the weight
of each topic covered in the test is determined. The weight assigned per topic in the test is
based on the relevance and the time spent to cover each topic during instruction. The
percentage of time for a topic in a test is determined by dividing the time spent for that
topic during instruction by the total amount of time spent for all topics covered in the test.
For example, for a test on the Theories of Personality for General Psychology 101 class,
the teacher spent ¼ to 1½ hours class sessions. As such, the weight for each topic is as
follows:
4. Determine the number of items for the whole test. To determine the number of items
to be included in the test, the amount of time needed to answer the items are considered.
As a general rule, students are given 30-60 seconds for each item in test formats with
choices. Fora one-hour class, this means that the test should not exceed 60 items.
However, because you need also to give time for test paper/booklet distribution and giving
instructions, the number of items should be less, maybe just 50 items.
5. Determine the number of items per topic. To determine the number of items to be
included in the test, the weights per topic are considered. Thus, using the examples above,
for a 60-item final test, Theories & Concepts, Humanistic Theories, Cognitive Theories,
Behavioral Theories, and Social Learning Theories will have 5 items, Trait Theories 10
items, and Psychoanalytic Theories - 15 items.
Percentage of
Topic No. of Items
Time (Weight)
Theories & Concepts 10.0 5
Psychoanalytic Theories 30.0 15
Trait Theories 20.0 10
Humanistic Theories 10.0 5
Cognitive Theories 10.0 5
Behavioral Theories 10.0 5
Social Learning Theories 10.0 5
Total 100 50 items
There are three (3) types of TOS: (1) one-way, (2) two-way, and (3) three-way.
1. One-Way TOS. A one-way TOS maps out the content or topic, test objectives, number
of hours spent, and format, number, and placement of items. This type of TOS is easy to
develop and use because it just works around the objectives without considering the
different levels of cognitive behaviors. However, a one-way TOS cannot ensure that all
levels of cognitive behaviors that should have been developed by the course are covered
in the test.
2. Two-Way TOS. A two-way TOS reflects not only the content, time spent, and number
of items but also the levels of cognitive behavior targeted per test content based on the
theory behind cognitive testing. For example, the common framework for testing at
present in the DepEd Classroom Assessment Policy is the Revised Bloom's Taxonomy
(DepEd, 2015). One advantage of this format is that it allows one to see the levels of
cognitive skills and dimensions of knowledge that are emphasized by the test. It also
shows the framework of assessment used in the development of the test. However, this
format is more complex than the one-way format.
3. Three-way TOS. This type of TOS reflects the features of one-way and two-way TOS.
One advantage of this format is that it challenges the test writer to classify objectives
based on the theory behind the assessment. It also shows the variability of thinking skills
targeted by the test. However, it takes a much longer to develop this type of TOS.
Activity 4:
Below are sets of competencies targeted for instruction taken from a particular subject
area in K – 12 curriculum. Check the assessment method appropriate for the given
competencies.
Sample 1 in Mathematics: Check the competencies appropriate for the given test
format/method
Sample 2 in Science: Check the competencies appropriate for the given test
format/method
Sample 3 in Language: Check the competencies appropriate for the given test
format/method
What are the general guidelines in choosing the appropriate test format?
Not every test is universally valid for every type of learning outcome. For example, if
an intended outcome for a Research Method 1 course is "to design and produce a
research study relevant to ones field of study”, you cannot measure this outcome through
a multiple-choice test or a matching-type test.
To guide you on choosing the appropriate test format and designing fair and
appropriate yet challenging tests, you should ask the following important questions:
1. What are the objectives or desired learning outcomes of the subject/unit/lesson
being
assessed?
o
Deciding on what test format to use generally depends on your learning objectives
or the desired learning outcomes of the subject/unit/lesson. Desired learning outcomes
(DLOS) are statements of what learners are expected to do or demonstrate as a result or
engaging in the learning process.
The assessment tasks should be aligned with the instructional activities and the
DLOs. Thus, it is important that you are clear about what DLOs are to be addressed by
your test and what course activities or tasks are to be implemented to achieve the DLOs.
For example, if you want learners to articulate and justify their stand on ethical
decision-making and social responsibility practices in business (i.e DLO), then an essay
test and class debate are appropriate measures and tasks for this learning outcome. A
multiple-choice test may be used but only if you intend to assess learner’s ability to
recognize what is ethical versus unethical decision-making practice. In the same manner,
matching-type items may be appropriate if you want to know whether your students can
differentiate and match the different approaches or terms to their dentitions.
Stem:
Faulty: Read each question and indicate your answer by shading the circle corresponding
to your answer.
Good: This test consists of two parts. Part A is a reading comprehension test and Part B is
a grammar/language test. Each question is a multiple-choice test item with five (5)
options. You are to answer each question but will not be penalized for a wrong
answer or for guessing. You can go back and review your answers during the time
allotted.
2. Write stems that are consistent in form and structure, that is, present items either in
question form or in descriptive or declarative form.
Faulty: (1) Who was the Philippine president during Martial Law
(2) The first president of the Commonwealth of the Philippines was ______.
Good: (1) Who was the Philippine president during Martial Law
(2) Who was the first president of the Commonwealth of the Philippines?
3. Word the stem positively and avoid double negatives, such as NOT and EXCEPT in a
stem. If a negative word is necessary, underline or capitalize then words tor emphasis.
4. Retrain from making the stem too wordy or containing too much information unless the
problem/question requires the facts presented to solve the problem.
Faulty: What does DNA stand for, and what is the organic chemical of complex molecular
structure found in all cells and viruses and codes genetic information for the transmission
of inherited traits?
Good: As a chemical compound, what does DNA Stand for?
Options:
1. Provide three (3) to five (5) options per item, with only one being the correct or best
answer/alternative.
2. Write options that are parallel or similar in form and length to avoid giving clues about
the correct answer.
6. Avoid All of the above as an option, especially if it is intended to be the correct answer.
7. Make all options realistic and reasonable.
The matching test item format requires learners to match a word, sentence, or
phrase in one column (i.e., premise) to a corresponding word, sentence, o phrase In a
second column (i.e., response). It is most appropriate when you need to measure the
learners' ability to identify the relationship or association between similar items. They work
best when the course content has many parallel concepts. While matching-type test format
is generally used for Simple recall of information, you can find ways to make it applicable
or useful in assessing higher level of thinking such as applying and analyzing.
The following are the general guidelines in writing good and effective
matching-type tests:
1. Clearly state in the directions the basis for matching the stimuli with the responses.
2. Ensure that the stimuli are longer and the responses are shorter.
3. For each item, include only topics that are related with one another and share the same
foundation of information.
4. Make the response options short, homogeneous, and arranged in logical order.
5. Include response options that are reasonable and realistic and similar in length and
grammatical form.
6. Provide more response options than the number of stimuli.
There are different variations of the true or false items. These include the following:
2. Yes-No variation. In this format, the learner has to choose yes or no, rather than true or
false.
3. A-B Variation. In this format, the learner has to choose A or B, rather than true or false.
Because true or false test items are prone to guessing, as learners are asked to
choose between two options, utmost care should be exercised in writing true or false
items. the following are the general guidelines in writing true or false items.
Double negatives are sometimes confusing and could result in wrong answers, not
because the learner does not know the answer but because of how the test items
are presented.
7. Avoid lifting statements from the textbook and other learning materials.
The following are the general guidelines in writing good fill-in-the-blank or completion test
items:
2. Do not omit too many words from the statement such that the intended meaning is lost.
6. If possible, put the blank at the end of a statement rather than at the beginning.
What are the general guidelines in writing essay tests?
Teachers generally choose and employ essay tests over other forms of assessment
because essay tests require learners to create a response rather than to simply select a
response from among alternatives. They are the preferred form of assessment when
teachers want to measure learners higher-order thinking skills, particularly their ability to
reason, analyze, synthesize, and evaluate. They also assess learners writing abilities.
They are most appropriate for assessing learner’s (1) understanding of subject-matter
content, (2) ability to reason with their knowledge of the subject, and (3) problem-solving
and decision skills because items or situations presented in the test are authentic or close
to real life experiences.
There are two types of essay test: (1) extended-response essay, and (2) restricted-
response essay.
The following are the general guidelines in constructing good essay questions:
1. Clearly define the intended learning outcome to be assessed by the essay test.
To design effective essay questions or prompts, the specific intended learning outcomes
are identified. Appropriate direct verbs that most closely match the ability that learners
should demonstrate must be used. These include verbs such as compose, analyze,
interpret, explain, and justify, among others.
2. Refrain from using essay test for intended learning outcomes that are better assessed
by other kinds of assessment.
It is important to take into consideration the limitations of essay tests when planning and
deciding what assessment method to employ tor an intended learning outcome.
3. Clearly define and situate the task within a problem situation as well as the type of
thinking required to answer the test.
Essay questions or prompts should provide clear and well-defined tasks to the learners.
It is important to carefully choose the directive verb, to write clearly the object or focus
of the directive verb, and to delimit the scope of the task.
4. Present tasks that are fair, reasonable, and realistic to the students
Essay questions should contain tasks or questions that students will be able to do or
address. These include those that are within the level of instruction/training, expertise,
and experience of the students.
5. Be specific in the prompts about the time allotment and criteria for grading the response.
Essay prompts and directions should indicate the approximate time given to the
students to answer the essay questions to guide them on how much time they should
allocate for each rem, especially if several essay questions are presented. How the
responses are to be graded or rated should also be clarified to guide the students on
what to include in their responses.
In order to establish the validity and reliability to an assessment tool, you need to
know the different ways to establishing test validity and reliability.
There are different factors that affect the reliability of a measure. The reliability of a
measure can be high or low, depending on the following factors:
1. The number or items in a test - the more items a test has, the likelihood or reliability is
high. The probability of obtaining consistent scores is high because of the large pool of
items.
2. Individual differences of participants - every participant possesses characteristics that
affect their performance in a test, such as fatigue, concentration, innate ability,
perseverance, and motivation. These individual factors change over time and affect the
consistency of the answers in a test.
3. External environment he external environment may include room temperature, noise
level, deep of instruction, exposure to materials, and quality of instruction, which could
affect changes in the responses of examinees in a test.
There are different ways in determining the reliability of a test. The specific kind of
reliability will depend on the (1) variable you are measuring, (2) type of test, and (3)
number of versions of the test.
The different types of reliability are indicated and how they are done.
1. Linear Regression
Linear regression is demonstrated when you have two variables that are measured,
such as two set of scores in a test taken at two different times by the same participants.
when the two scores are plotted in a graph (with X- and Y-axis), they tend to form a
straight line. The straight line formed for the two sets of scores can produce a linear
regression. When a straight line is formed, we can say that there is a correlation between
the two sets of scores. This can be seen in the graph shown. This correlation is shown in
the graph given. The graph is called a scatterplot. Each point in the Scatterplot is a
respondent with two scores (one for each test).
2. Computation of Pearson r correlation
The index of the linear regression is called a correlation coefficient. When the points
in a scatterplot tend to fall within the linear line, the correlation is said to be strong. When
the direction of the scatterplot is directly proportional, the correlation coefficient will have a
positive value. If the line is inverse, the correlation coefficient will have a negative value.
The statistical analysis used to determine the correlation coefficient is called the Pearson f,
How the Pearson r is obtained is lustrated below.
Suppose that a teacher gave the spelling of two-syllable words with 20 items for
Monday and Tuesday. The teacher wanted to determine the reliability of two set of scores
by computing for the Pearson r.
Formula:
N(ΣXY) – (ΣX)(ΣY)
r = ------------------------------------
[NΣX² - (ΣX)²][NΣY² - (ΣY)²]
The value of a correlation coefficient does not exceed 1.00 or -1.00. A value of 1.00
and -1.00 indicates perfect correlation. In test of reliability though, we aim for high positive
correlation to mean that there is consistency in the way the student answered the tests
taken.
When the value of the correlation coefficient is positive, it means that the higher the
scores in X, the higher the scores in Y. This is called a positive correlation. In the case of
the two spelling scores, a positive correlation is obtained. when the value of the correlation
coefficient is negative, it means that the higher the scores in X, the lower the scores in Y,
and vice versa. This is called a negative correlation. When the same test is administered to
the same group of participants, usually a positive correlation indicates reliability or
consistency of the scores.
The strength of the correlation also indicates the strength of the reliability of the test.
This is indicated by the value of the correlation coefficient. The closer the value to 1.00 01.,
e stronger is the correlation. Below is the guide:
The correlation obtained between two variables could be due to chance. In order to
determine if the correlation is free of certain errors, it is tested for significance. When a
correlation is significant, it means that the probability of the two variables being related is
free of certain errors.
Suppose that five Students answered a checklist about their hygiene with a scale of
1 to 5, where in the following are the corresponding scores.
5-always, 4-often, 3-sometimes, 2-rarely, 1-never
The checklist has five items. The teacher wanted to determine if the items have
internal consistency.
The internal consistency of the responses in the attitude toward teaching is 0.10,
indicating low internal consistency.
The scores given by 3 raters are first computed by summing up the total ratings for
each demonstration. The mean is obtained for the sum of ratings (8.4). The mean is
subtracted from each of the sum of ratings (D). Each difference is squared (D²), then the
sum of squares is computed (∑D² = 33.2). The mean and summation of squared difference
is substituted in the Kendall’s formula. In the formula, m is the number of raters.
12ƩD² 12(33.2)
W = -------------------- = ----------------- = 0.37
m²(N)(N² - 1) 3²(5)(5² - 1)
W = coefficient of concordance
D = the difference between the individual sum of ranks of the raters and the
average of the sum of ranks of the object or individuals
m = no. of judges or raters
N = no. of objects or individuals being rated
A Kendall's ω coefficient value of 0.37 indicates the agreement of the three raters in
the five demonstrations. There is moderate concordance among the three raters because
the value is far from 100.0. Kendall’s can be interpreted as in Pearson r.
What is test validity?
Item analysis procedures allow teachers to discover items that are ambiguous, irrelevant,
too easy or difficult, and non-discriminating. It also enhances the technical quality of an
examination, facilitate classroom instruction and identifies the areas of a student’s weakness,
providing information for specific remediation. There are 2 important characteristics of an item: 1)
item difficulty; 2) discrimination index. The difficulty of an item or item difficulty is the number of
students who are able to answer the item correctly divided by the total number of students.
Correct answer is 3
Step 1:
97 - 3 79 - 1 55 - 1
96 - 3 77 - 3 54 - 4
95 - 3 76 - 4 51 - 3
94 - 3 75 - 2 50 - 5
93 - 3 73 - 5 49 - 1
92 - 3 71 - 1 48 - 2
91 - 3 70 - 4 47 - 4
90 - 3 69 - 3 45 - 4
89 - 3 68 - 3 44 - 3
88 - 3 67 - 5 43 - 2
87 - 1 66 - 1 42 - 3
85 - 2 65 - 2 41 - 1
84 - 4 64 - 3 40 - 3
83 - 5 60 - 3 35 - 1
81 - 5 59 - 5 32 - 1
Item 21:
Options 1 2 3* 4 5
------------------------------------------------------------------------
Upper (15) 1 1 10 1 2
Lower (15) 5 2 4 3 1
------------------------------------------------------------------------
*Correct answer
Interpretation of D values:
Decision Table:
Note: A good distracter attracts students in the lower group more than in the upper
group
From the example above:
Option 1 – good
Option 2 – good
Option 3 – good
Option 4 – good
Option 5 – poor
Interpreting Test Scores:
Raw scores are scores when tests are corrected. These scores may or may not
represent student’s ability in the subject nor his capacity to learn the subject. It is
necessary to weigh the scores and weighing may be done by either dividing or multiplying
the score on one section of a test by a number calculated to give the desired weight to a
particular exercise. Raw scores remain meaningless unless interpreted. One way of
interpreting scores is by means of transmutation.
TG = (Score/No. of items) x 50 + 50
When developing a teacher-made test, it is good to have items that are easy.
average, and difficult with positive discrimination indices. If you are developing a
standardized test, the rule is more stringent as it aims for average items or not so easy nor
difficult items and whose discrimination index is at least 0.3.
Lesson 7 – Organization of Test Data Using Tables and Graphs
Test data are better appreciated and communicated if they are arranged, organized,
and presented in a clear and concise manner. Good presentation requires designing a
table that can be read easily and quickly. Tables and graphs re common tools that help
readers better understand the test results that are conveyed to concerned groups like
teachers, students, parents, administrators, or researchers, which are used as basis in
developing programs to improve learning of students.
The Textual Form: It is utilized when data to be presented are purely qualitative or when
very few numbers are involved.
The Tabular Form: Statistical table is a more effective device in presenting data. A
statistical table has four essential components:
1. Table heading – shows table # and title
2. The Body – main part of the table (quantitative info)
3. The Stubs - classifications or categories which are presented as values of
a variable.
4. The Boxheads – the captions that appear above the Columns
Line Graph
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8
Pie Chart
1 2 3 4 5 6 7 8
FREQUENCY DISTRIBUTION:
Ll + Ul
Xm = -----------------
2
Example:
Class interval f Xm Class Boundaries
----------------------------------------------------------------------------
50 – 54 4 52 49.5 – 54.5
45 – 49 7 47 44.5 – 49.5
40 – 44 12 42 39.5 – 44.5
35 – 39 10 37 34.5 – 39.5
30 – 34 9 32 29.5 – 34.5
25 – 29 6 27 24.5 – 29.5
20 – 24 2 22 19.5 – 24.5
--------------------------------------------------------------------------
i= 5 n = 50
Construction of a FD:
1. Determine the Range (R) by taking the difference of the highest score and the lowest
score.
2. Decide on the number of class intervals. Compute the number of intervals, n, by using
the formula: n = 1 + 3.3 logN, where n = No. of class intervals, N = population
3. Divide the range by the desired # of class interval to get the interval size (i). i = R/n
4. Using the lowest score as lower limit, add (i – 1) to it to obtain the upper limit of the
desired class interval.
5. The lower limit of the 2 nd interval (& so on) may be obtained by adding the class size to
the lower limit of the first interval.
Example
The following are the scores of the 3rd year BSEd students in Statistics test.
43 58 21 24 31 49 40 51 55 28
50 33 62 30 25 39 59 29 36 42
38 46 42 16 50 41 37 35 40 52
47 35 57 55 36 45 32 45 42 36
Solution:
R = 62 – 16 = 46
n = 1 + 3.3logN
= 1 + 3.3log40
= 6.29 or 7 (round up, if there is a decimal number, if it is a whole number, you
will need to add an extra class to accommodate all the data)
X Tally f____
58 – 64 lll 3
51 – 57 llll 5
44 – 50 llll – ll 7
37 – 43 llll – llll 10
30 – 36 llll – llll 9
23 – 29 llll 4
16 – 22 ll 2____
i= 7 n = 40
Derived Frequencies:
1. Relative Frequency Distribution
2. Cumulative Frequency Distribution
3. Cumulative Percentage Frequency Distribution
rf = f / n x 100%
Cpf = cf / n x 100%
Frequency Polygon:
Activity 7:
Consider the following raw data on an arithmetic test:
56 42 68 56 42 78 54 53 56
55 62 44 48 55 57 37 62 47
66 65 54 72 52 42 68 39 38
50 52 47 62 82 41 48 42 60
28 47 48 56 45 58 70 80 67
a. Construct a FD
b. Determine the class marks, class boundaries,
relative frequency, cf< and cf>, cpf < and cpf>
c. Construct a frequency polygon
The word "measures of central tendency" means the central location or point of
convergence of a set of values. Test scores have a tendency to converge at a central
value. This value Is the average of the set of scores. In other words, a measure of central
tendency gives a single value that represents a given set of scores. Three commonly-used
measures of central tendency or measures of central location are the mean, the median,
and the mode.
Mean – the center of gravity of the distribution and is the most widely used. The mean is
sensitive to
extreme scores.
x̅ = Σ X / n sample mean
µ= ΣX/N population mean
Example: The following are the family sizes of a sample of 10 households in a slum area.
2, 3, 4, 2, 5, 3, 7, 4, 3, 7
n = 10, ΣX = 40
x̅ = ΣX / n = 40 / 10 = 4
Midpoint Formula:
X f Xm fXm
--------------------------------------------------------
50 – 54 4 52 208
45 – 49 7 47 329
40 – 44 12 42 504
35 – 39 10 37 370
30 – 34 9 32 288
25 – 29 6 27 162
20 – 24 2 22 44
--------------------------------------------------------
i= 5 n = 50 Σ fXm = 1,905
Median = the score-point which divides the distribution into two equal parts; it is the value
which lies
50% of the data; it is not sensitive to extreme scores.
(n+1) th
mdn = ------ score from the lowest
2
Example 1: Find the median of the following sample data:
6, 15, 8, 42, 18, 24, 23,
n=7
mdn = 18
92, 98, 98, 100, 102, 108, 120, 121, 132, 140
n/2 - cfP
mdn = Xlb + ----------- i
fm
Example:
X f cf
--------------------------------------------------------------
50 – 54 4 50
45 – 49 7 46
40 – 44 12 39
35 – 39 10 27
30 – 34 9 17
25 – 29 6 8
20 – 24 2 2
--------------------------------------------------------------
i= 5 n = 50
n/2 = 50/2 = 25
n/2 - cfP
Mdn = Xlb + ----------- i
fm
25 - 17
Mdn = 34.5 + ----------- (5)
10
Mdn = 34.5 + (0.8) 5
Mdn = 34.5 + 4
Mdn = 38.5
For ungrouped data, the mode is obtained by mere inspection. While it is possible
for a set of values to have no mode, it is also possible for other sets to have more than one
mode.
Example:
4, 5, 8, 8, 8, 9, 12, 12, 15, 19, 20
Mo = 8
Formulas:
1. Mo = 3(Mdn) – 2(Mean)
Mo = 3(38.5) – 2 (38.1)
Mo = 115.5 – 76.2
Mo = 39.3
d1
2. Mo = Xlb + ----------- i
d1 + d 2
Example:
X f modal class: 40 - 44
------------------------------ Xlb = 39.5
50 – 54 4 d1 = 2
45 – 49 7 d2 = 5
40 – 44 12
35 – 39 10 Mo = 39.5 + (2/2+5) 5
30 – 34 9 Mo = 39.5 + (2/7) 5
25 – 29 6 Mo = 39.5 + (.286) 5
20 – 24 2 Mo = 39.5 + 1.43
------------------------------ Mo = 40.93
i= 5 n = 50
Scales of Measurement:
1. Nominal Scale – classifies elements into two or more categories or classes ( gender,
religion, etc.)
2. Ordinal Scale – ranks the individuals in terms of the degree to which they possess a
characteristic of interest
3. Interval Scale – in addition to ordering scores from highest to lowest, establishes a
uniform unit in the scale so that any distance between two consecutive
scores is of equal magnitude.
4. Ratio Scale – in addition to being an interval scale, also has an absolute zero in the
scale (height, weight , area, volume, speed, etc.).
Skewness:
A distribution actually takes the form of a bell-shaped or normal curve if the mean,
the median and the mode are equal. It becomes positively skewed if the mean is greater
than the median and negatively skewed if otherwise. As a general rule, the closer the
coefficient of skewness is to zero, the less skewed the distribution will be, and the farther
this coefficient is from zero, the more skewed the distribution will be.
3(x̅ – Mdn.)
Sk = -------------------
S
The performance of the students is said to be satisfactory or very satisfactory when
the curve is negatively skewed while it is unsatisfactory if it is positively skewed.
The curve is positively skewed if it tails-off to the right and negatively skewed if it tails-
off to the left.
Example: Given the following scores, find the 1st and 3rd Quartiles
90, 85, 86 109, 105, 88, 100, 85, 105, 110, 112, 100
112
Upper half
110
109 Q3 = (109 + 105) ÷ 2 = 107
105
105
100
------------------------------------------------------------
100
Lower90 half
88 Q1 = (88 + 86) ÷ 2 = 87
86
85
85
n/4 - cfP
Q1 = Xlb + ------------- i
fq
3n/4 - cfP
Q3 = Xlb + ------------- i
fq
Σ (X – x̅ )²
(Standard Dev.) S = -------------
√ n–1
Example:
19,434.60
(Variance) S² = --------------- = 329.4
60 – 1
Coded Formula:
n ΣfX’² - (ΣfX’)²
S² = ---------------------- (i²)
n(n – 1)
nΣfX’² - (ΣfX’)²
S = -------------------- (i²)
√ n(n – 1)
X f_____X’___fX’ fX’²___
90 – 98 3 4 12 48
81 – 89 8 3 24 72
72 – 80 12 2 24 48
63 – 71 11 1 11 11
54 – 62 10 0 0 0
45 – 53 6 -1 -6 6
36 – 44 5 -2 -10 20
27 – 35 3 -3 -9 27
18 – 26 2 -4 -8 32____
i=9 n = 60 Σ fX’ = 38 Σ fX’² = 264
Variance:
60(264) – (38)²
S² = ---------------------- (9²)
60(60 – 1)
15840 – 1444
S² = ------------------- (81)
3540
S² = (14,396 ÷ 3,540) 81
S² = 329.4
Standard Deviation:
S = √329.4
S = 18.15
Standard Scores:
There are many kinds of standard scores. The most useful is the Z-score, which is
often used to express a raw score in relation to the mean and standard deviation. In other
words, it is how standard deviation is being used. We transform standard score to Z-score.
Z = (X – x̅ ) ÷ S , X = raw score
Example:
On the first day of the final examination week, Larry took the tests for three subjects
– mathematics, economics, and philosophy. Although he felt that his preparation was the
same for these three tests, he believed he did a very good job on the philosophy test. The
test results were released 5 days later and Larry gathered the following information:
Math: Mean = 60, S = 8, and his raw score is 70
Economics: Mean = 72, S = 6, and his raw score is 78
Philosophy: Mean = 85, S = 5, and his raw score is 82
On which test did Larry perform Best? Worst? (Assuming that the 3 subjects have
the same number of items)
Best in Math: It is 1.25 standard deviation above the mean under the normal curve
Worst in Philosophy: While 85 is numerically higher than 70 (Math), the result of Z-score
is -0.6, which means that his score is more than half standard deviation below the
average performance of the whole class.
You should note that the range, semi-inter quartile range or quartile deviation, and
standard deviations discussed earlier are expressed in the units of original scores. Thus,
they are measures of absolute dispersion. Let us say one distribution of test scores in
mathematics may have a standard deviation of 10, and another distribution of scores in
science may have a standard deviation of 5. if we want to compare the variability of the
two distributions, can we say that the distribution with standard deviation of 10 has twice
the variability of the one with standard deviation of 5? Consider another example. One
distribution has a standard deviation of 8 meters, while another has a standard deviation of
₱15.00. Can we say that the latter distribution is more spread than the former?
Or can we compare standard deviations in meter and pesos? The answer seems obvious.
We cannot conclude anything by direct comparison of measures of absolute dispersion
because they are of different units or different categories in the first example, one is the
distribution of mathematics scores while the other is the distribution of science scores. To
make the comparison logical, we need a measure of relative dispersion which is
dimensionless or "unit free" This measure of relative dispersion is also known as the
coefficient of variation. This is simply the ratio of the standard deviation of a distribution
and the mean of the distribution which is expressed as percentage value.
cv = (s ÷ x̅ ) 100%
Scores in Math are more dispersed than the other two subjects.
It is also called Gaussian distribution, named after Carl Friedrich Gauss. This
distribution has been used as a standard reference for many statistical decisions in the
field of research and evaluation.
In the discussion about normal distribution, the standard deviation becomes more
useful because it is used to determine the percentage of scores that fall within a certain
number of standard deviations from the mean. As a result of many experiments, empirical
rules have been established pertaining to the areas under the normal curve. In
assessment, the area in the curve refers to the number of scores that fall within a specific
standard deviation from the mean score, in other words, each portion under the curve
contains a fixed percentage of cases as follows:
68% of the scores fall between one standard deviations below and above the mean
95% of the scores fall between two standard deviations below and above the mean
99.77% of the scores fall between three standard deviations below and above the mean
There are other two standard scores aside from the Z-score and these are the T-
score and the Stanine scores.
The T-score:
As you see in the computation of the z-score, it can give you negative number,
which simply means the score is below the mean. However, communicating negative z-
score as below the mean may not be understandable to others. We will not even say to
students that they got a negative z-score. A z-score may also be a repeating or
nonrepeating decimal, which may not be comfortable for others. One option is to convert a
z-score into a T-score which is a transformed standard score. To do this, there is
scaling in which a mean of 0 in a z-score is transformed into a mean of 50, and the
standard deviation in z-score is multiplied by 10. The corresponding equation is:
T-score = 50 +10z
T-score = 50 + 10(Z)
= 50 + 10(-2)
= 50 – 20
= 30
Looking back at the Philosophy score of Larry in our previous example, which
resulted in a z-score of -0.6, T-score equivalent is:
T = 50+ 10(-.6)
= 50 – 6
= 44
T-scores are convenient because scores below 0 and above 100 are virtually
impossible; in fact 99.7% of the time, a T-score will be between 20 and 80, because these
limits are standard deviations below and above the mean respectively.
Stanine Scores.
Another standard score is stanine, shortened from standard nine. With nine in its
name the scores are on a nine- point scale. in a Z-score distribution, the mean is 0, and
the standard deviation is 1. In this scale, the mean is 5, and the standard deviation is 2.
Each stanine is one-half standard deviation-wide. Like the T-score, stanine score can be
calculated from the Z-score by multiplying the Z-score by 2 and adding 5. That is:
Stanine = 2Z + 5
Going back to our example of Larry’s score in Philosophy that is 82 with a Z-score
of -0.6, its stanine equivalent is:
Stanine = 2(-0.6) + 5
= -1.2 + 5
= 3.8 or 4
On the assumption that stanine scores are normally distributed, the percentages of cases
in each band or range of scores in the scale are as follows:
With the above percentage distribution of Scores in each stanine, you can directly
convert a set of raw scores into stanine scores. Simply arrange the raw scores from lowest
to highest, and with the percentage of scores in each stanine, you can directly assign the
appropriate stanine score in each raw score. On interpretation of stanine score, let us say
Kate has a stanine score of 2. We can see that her score is somewhere at the low or
bottom 7 percent of the scores. In the same way, if John’s score is in the 6th stanine, it
falls between the 60th and 77th percentile, simply because 60 percent of the scores are
below the 6th stanine and 23 percent of the scores are above the 6th stanine. For
qualitative description, stanine Scores of 1, 2, and 3 are considered as below average; 4,
5, and 6 are average, and 7, 8, and 9 are above average. Thus, you can say that your
score of 86 in English is above average. Similarly, Kate's score is below average while that
of john is average.
The figure in the next page illustrates the equivalence of the different commonly
used standard scores.
The figure above is the Normal Distribution and the Standard Scores
What are measures of covariability?
There are situations when we look at examinees' performance measures, we ask
Ourselves what could explain such scores. Measures of covariability tell us to a certain
extent the relationship between two tests or two factors. Admittedly, a score one gets may
not only be due to a single factor but with other factors directly or indirectly observable,
which are also related to one another. This section will be limited to introducing two scores
that are hypothesized to be related to one another.
When we are interested in finding the degree of relationship between two scores,
we are dealing with the correlation between two variables. The statistical measure is the
correlation coefficient, an index number that ranges from-1.0 to 1.0. The value -1.0
indicates a negative perfect correlation, 0.00 no correlation at all, and 1.00 a perfect
positive correlation. There have been many correlation studies conducted in the field of
assessment and research, but correlation coefficients did not result in exact values of 0.00,
+1.0 and -1.0; instead, the correlation values are either closer to +1.0 or -1.0.
To measure the relationship of two variables, we use the Pearson Product Moment
Correlation (r)
N(ΣXY) – (ΣX)(ΣY)
r = -------------------------------------
[NΣX² - (ΣX)²][NΣY² - (ΣY)²]
This formula has been introduced earlier in Lesson 6, when you were taught on how
to compute for the reliability coefficient of scores. This is the same Pearson r, but this time,
it is used to establish relationship between two sets of data.
The above mathematical processes gave a correlation coefficient of 0.705 between
performance cores in reading and problem-solving. This coefficient indicates a strong
relationship or very high relationship between the two variables.
Reading
49%
Problem-
solving
Activity 8:
.Class Interval f
60-65 2
55-59 5
50-54 6
45-49 8
40-44 11
35-59 10
30-34 11
25-29 20
20-24 17
15-19 6
10-14 4
i= n=
2. A Common exit examination is given to 400 students in a university. {he scores are
normally distributed, and the mean is 78 with a standard deviation of 6. Daniel had a score
of 72 and Jane a score of 84. What are the corresponding Z-scores of Daniel and Jane?
How many students would be expected to score between the scores of Daniel and Jane?
Explain your answer.
3. James obtained a score of 40 in his Mathematics test and 34 in his Reading test. The
class mean score in Mathematics is 45 with a standard deviation of 4 while in Reading, the
mean score is 50 with a standard deviation of 7. On which test did James do better
compared to the rest of the class? Explain your work.
4. Following are sets of scores on two variables: X tor reading comprehension and Y for
Reasoning Skills administered to sample of students.
X: 11 9 15 7 5 9 8 4 8 11
Y: 13 8 14 9 8 7 7 5 10 12
What are the purposes of grading and reporting learners' test performance?
There are various reasons why we assign grades and report learners' test
performance. Grades are alphabetical or numerical symbols/marks that indicate the
degree to which learners are able to achieve the intended learning outcomes. Grades do
not exist in a vacuum but are part or the instructional process and serve as a feedback
loop between the teacher and learners. They are one of the ways to communicate the level
of learning of the learners in specific course content. They give feedback on what specific
topic/s learners have mastered and what they need to focus more when they review for
Summative assessment or final exams. In a way, grades serve as a motivator for learners
to study and do better in the next tests to maintain or improve their final grade.
Grades also give the parents, who nave the greatest stake in learner’s education,
information about their children’s achievements. They provide teachers some bases for
improving their teaching and learning practices and for identifying learners who need
further educational intervention. They are also useful to school administrators who want to
evaluate the effectiveness of the instructional programs in developing the needed skills
and competencies of the learners.
There are various ways to score and grade results in multiple-choice tests.
Traditionally, the two most commonly-used scoring methods are number right Scoring
(NR) and negative marking (NM).
Number Right Scoring (NR) entails assigning positive values only to correct
answers while giving a score of zero to incorrect answers. The test score is the sum or the
scores for correct responses. One major concern with this scoring method Is that learners
may get the correct answer by guessing thus, affecting the test reliability and validity.
Negative Marking (NM) entails assigning positive values to correct answers while
punishing the learners for incorrect responses (i.e., right-minus-wrong correcting method).
In this model, a fraction of the number of wrong answers is subtracted from the number of
correct answers. Other models for this type of scoring method include (1) giving a positive
score to correct answer while assigning no mark for omitted items and (2) rewarding
learners for not guessing by awarding points rather than penalizing learners tor incorrect
answers. The recommended penalty tor an incorrect answer is 1/(n-1), where n stands for
the number of choices.
Liberal Multiple-choice Test - It allows learners to select more than one answer to
a question if they feel uncertain which option or alternative is Correct.
Elimination Testing (ET) - It instructs learners to cross out all alternatives they
consider to be
incorrect.
Confidence Weighting (CW) - It asks learners to indicate. what they believe Is the
correct answer and how confident they are about their choice.
For this type of scoring, an item can be assigned different scores, depending on the
learners' response.
Multiple Answers Scoring Method allows learners to have multiple answers for
each item. In this method, learners are instructed that each item has at least one correct
answer or how many answers are correct. Items can be scored as solved only if all the
correct response options are marked but none of the incorrect others. Incorrect options
that are marked can lead to negative scores. Thayn (2011) found that multiple answers
and single answer items have the same discrimination power, item difficulty, and reliability
indices. However, multiple answers method is more difficult to solve, has lower
discrimination power, and takes more time to answer.
For example, for a final examination in algebra, the Mathematics Department can
set the passing score (eg. /5 percentile rank) based on the norms derived from the scores
of learners for the past three years. To do this, the department will need to collect the
previous scores of learners on the same or equivalent Final exams and apply the formula
for standard scores to compute for the percentile ranks for each range of scores. On the
other hand, passing grades/scores are usually set by the department or the school based
on their standards (e.g., A (90-100 percent), B (80-83 percent, C t(70-/9 percent, or F (0-69
percent).
Holistic Scoring involves giving a single, overall assessment score tor an essay,
writing composition, or other performance-type of assessment as a whole. Although the
scoring rubric for holistic scoring lays out specific criteria for evaluating a task, raters do
not assign a score for each criterion. Instead, as they read a writing task or observe a
performance task, they balance strengths and weaknesses among the various criteria to
arrive at an overall assessment. Holistic Scoring is considered efficient in terms of time
and cost. It also does not penalize poor performance based on only one aspect (e.g.
content, delivery, organization, vocabulary, or coherence for oral presentation). However, it
is said that holistic scoring does not provide sufficient diagnostic information about the
students ability as it does not identity the areas for improvement and is difficult to interpret
as it does not detail the basis for evaluation.
Sample of Holistic Rubric for an Oral Presentation
3 – Excellent Speaker
• Included 10 – 12 changes in hand gestures
• No apparent inappropriate facial expressions
• Utilizes proper voice inflection
• Can create proper ambiance for the poem
2 – Good Speaker
• Included 5 – 9 changes in hand gestures
• Few inappropriate facial expressions
• Have some inappropriate voice inflection changes
• Almost creating proper ambiance
1 – Poor Speaker
• Included 1 – 4 changes in hand gestures
• Lots of inappropriate facial expressions
• Uses monotone voice
• Cannot create proper ambiance
Analytic Scoring involves assessing each aspect of a performance task (e.g.,
essay writing, oral presentation, class debate, and research paper) and assigning a score
for each criterion. Sometimes, an overall score is given by averaging the scores in all
criteria. One advantage of analytic scoring is its reliability. It also provides information that
can be used as diagnostic as it presents learners' strengths and weaknesses and in what
area/s and eventually as basis for remedial instructions. However, it is more time
consuming and therefore, expensive. It is also prone to halo effect, wherein scores in one
scale may influence the ratings of others. It is also difficult to create.
Primary Trait Scoring focuses on only one aspect or criterion of a task, and a
learner’s performance is evaluated on only one trait. This scoring system defines a primary
trait in the task that will then be scored. For example, if a teacher in a political science
class asks his students to write an essay on the advantages and disadvantages of Martial
Law (i.e., the writing task), the basic question addressed in scoring is, "Did the writer
successfully accomplish the purpose of this task? With this focus, teacher would ignore
errors in conventions of written language but instead focus on overall rhetorical
effectiveness. One disadvantage or this scoring scheme is that it is often difficult to focus
exclusively on one trait, such that other traits may be included when scoring. Thus, it is
important that a very detailed scoring guide is used for each specific task.
Test scores can take the form of any of the following: (1) raw scores, (2) parentage
scores, and (3) derived scores. Under the derived scores are grades that are based on
criterion-referenced and norm-referenced grading system.
1. Raw Score is simply the number of items answered correctly on a test. A raw Score
provides an indication of the variability in the performance of students in the class.
However, a raw score has no meaning unless you know what the test IS measuring and
how many items it contains. A raw score also does not mean much because it cannot
be compared with a standard or with the performance of another learner or of the class
as a whole. Raw scores may be useful if everyone knows the test and what it covers,
how many possible right answers there are, and how learners typically do in the test.
2. Percentage score. This refers to the percent of items answered correctly in a test. The
number of items answered correctly is typically converted to percent based on the total
possible score. The percentage score is interpreted as the percent of content, skills, or
knowledge that the learner has a solid grasp of. Just like raw score, percentage score
has limitation because there is no way of comparing the percentage correct obtained in
a test with the percentage correct in another test with a different difficulty level.
3.1 Pass or Fail Grade. This type of score is most appropriate if the test or assessment
is primarily or entirely to make a pass or fail decision. In this type of scoring. a
standard or cut-off score is preset, and a learner is given a score of pass if he or
she surpassed the expected level of performance or the cut-off score.
Pass or Fail grading has the following advantages: (1) it takes pressure off the
learners in getting a high letter or numerical grade, allowing them to relax while still
getting the needed education; (2) it gives learners a clear cut idea of their strengths
and weaknesses; and (3) it allows learners to focus on true understanding or
learning of the course content rather than on specific details that will help them
receive a high letter or numerical score.
3.2 Letter Grade. This is one of the most commonly used grading systems. Letter
grades are usually composed of five-level grading Scale labeled from A to E or F,
with A representing the highest level of achievement or performance, and E or F-the
lowest grade-representing a Failing grade. These are often used for all forms of
learners' work, such as quizzes, essays, projects, and assignments.
3.3 Plus (+) and Minus (-) Letter Grades. This grading provides a more detailed
descriptions of the level of learners' achievement or task/test performance by
dividing each grade category into three levels, such that a grade of A Can be
assigned as At, A and A-; B as B+, B and B-, and so on. Pus (+) and minus (-)
grades provide a finer discrimination between achievement or performance levels.
They also increase the accuracy of grades as a reflection of learner's performance
enhance student motivation (1.e., to get a high A rather than an A-) and discriminate
among performance in a very similar pool of learners, such as those in advanced
courses or star sections. However, +/- grading system is viewed as unfair,
particularly for learners in the highest category; creates stress for learners, and is
more difficult for teachers as they need to deal with more grade categories when
grading learners.
Examples of the descriptors for plus (+) and minus (-) letter grades are presented below:
3.4 Categorical Grades. This system of grading is generally more descriptive than
letter grades, especially if coupled with verbal labels. Verbal labels eliminate the
need for a key or legend to explain what each grade category means.
4.1 Developmental Score. This is the score that has been transformed from raw scores
and reflect the average performance at age and grade levels. There are two kinds of
developmental scores: (1) grade-equivalent and (2) age-equivalent scores.
4.1.1 Grade-Equivalent Score is described as both a growth score and status score.
The grade equivalent of a given raw score on any test indicates the grade level at
which the typical learner earns this raw score. It describes test performance of a
learner in terms of a grade level and the months since the beginning of the
school year. A decimal point is used between the grade and month in grade
equivalents. For example, a score of 7.5 means that the learner did as well as a
Grade / taking the test at the end or the fifth month of the school year.
4.1.2 Age-Equivalent Score indicates the age level that is typical to a learner to obtain
such raw score. It reflects a learner's performance in terms of the chronological
age as compared to those in the norm group. Age-equivalent scores are written
with a hyphen between years and months. For example, a learner's score of 11-5
means that his age equivalent is 11 years and 5 months old, indicating a test
performance that is similar to that of 11½ year-olds in the norm group.
4.2 Percentile Ranks. Percentile Rank is useful in cases where comparison between
individual scores relative to their positions in the entire group is a major concern. One
example is the Licensure Examination for Teachers (LET average scores are actually
percentile ranks). An examinee who surpassed 90% of all the examinees gets a score
of 90 and an examinee who belongs to the top 2% gets 98. Percentile ranks are also
valuable tools for the comparison of two or more measurements, each taken from a
different set of data.
4.3 Stanine Score. This system expresses test results in nine equal steps which range
from one (lowest) to nine (highest). Percentile ranks are grouped into stanines, with
the following verbal interpretations:
4.4 Standard Scores. They are raw scores that are converted into a common scale of
measurement that provides meaningful description of the individual scores within the
distribution. Two types of a standard score are the Z-score and the T-ratio, (please
see the previous discussions).
Utmost care should be observed to ensure that grading practices are equitable, fair,
and meaningful to earners and stakeholders. When constructing a test or performance
task, the methods and criteria for grading learners’ responses or answers should be set
and specified. The following are the general guidelines in grading tests or performance
tasks:
2. Be guided by the desired learning outcomes. The learners should be informed early
on what are expected of them insofar as learning outcomes are concerned, as well as
how they will be assessed and graded in the test.
4. Inform learners what scoring methods are to be used. Learners should be made
aware before the start of testing, whether their test responses are to be scored based
on the number right, negative marking, or through non-conventional scoring methods.
As such, the learners will be guided on how to mark their responses during the test.
5. Decide on what type of test scores to use. It is important that different types of
grading scheme be used for different tests, assignments, or performance tasks.
Learners should also be informed at the start of what grading system is to be used for
a particular test or task.
Essays require more time to grade than the other types of traditional tests Grading
essay tests can also be influenced by extraneous factors, such as learners handwriting
legibility and raters' biases. It is therefore important that you devise essay question
prompts and grading scheme procedures that will minimize the threats to validity and
reliability.
2. Determine the type of rubric to use. There are two basic types of rubric holistic or
analytic scoring system. Holistic rubrics require evaluating the essay and taking into
consideration all the criteria. Only a single score is given based on the overall
judgment of the learner's writing composition. Holistic rubric is viewed to be more
convenient for the teachers as it requires less area or aspect of writing to evaluate.
However, it does not provide specific feedback on what course topic/content or criteria
that the students are weak at and need to improve on. On the other hand, analytic
scoring system requires that the essay is evaluated based on each of the criteria. It
provides useful feedback on learner's strengths and weaknesses for each Course
content or criterion.
3. Prepare the rubric. In developing rubric, the skills and competencies related to essay
Writing should first be identified. These skills and competencies represent the criteria.
Then, performance benchmarks and point values are determined. Performance marks
can be numerical categories, but the most frequently used are descriptors with
corresponding rating scale.
5. Score one essay question at a time. This is to ensure that the same thinking and
standards are applied for all learners in the class. The rater should try to avoid any
distraction or interruption when evaluating the same irem.
6. Be conscious of own biases when evaluating a paper. The rate should not be
affected by learners' handwriting, writing style, length of responses, and other factors.
He/she should stick to the criteria included in the rubric when evaluating the essay.
7. Review initial scores and comments before giving the final rating. This is important
especially for essays that were initially given a barely passing or failing grade.
8. Get two or more raters for essays that are high-stake, such as those used for
admission, placement, or scholarship screening purposes. The final grade will be the
average of all the ratings given.
9. Write comments next to the learner's responses to provide feedback on how well one
has performed in the essay test.
The Final Grade for each subject is then computed by getting the average of the four
quarterly grades, as seen below:
The General Grade, on the other hand, is computed by getting the average of the
Final Grades for all subject areas. Each subject area has equal weignt:
All grades reflected in the report card are reported as whole number
If test results are to serve as a mechanism to inform learners on what, where and
how they should improve in their work and learning as a whole, then an effective and
efficient reporting system should be in place. Teachers should come up with guidelines
and processes on how grades are to be communicated and presented to make them clear,
understandable, and relevant to the recipients. Unless the test results are communicated
effectively, the purpose of assessment is not likely to be achieved.
First, the rationale or purpose of testing and the nature of the tests administered to
the learners should be clearly explained. This is especially true for high-stake testing, such
as those used for placement, admission, grade level promotion, graduation decisions, as
well as for 1Q or psychological testing, which are more likely to be misinterpreted. It is
important to inform the students and their parents that tests are only one of several tools to
assess their performance or achievement and that they are not evaluated on the basis of
one test alone.
References:
1. Balagtas, Marilyn U., David, Adonis P., Golla, Evangeline F., Magno, Carlo P., and
Valladolid, Violeta C. (2020), “Assessment in Learning 1”, 1st Edition, Rex Book
Store, Inc., Sampaloc, Manila, Philippines.
2. Anderson, W. L., and Krathwohl, D. R. (2001). “A Taxonomy for Learning, Teaching, and
Assessing: A revision of Bloom’s Taxonomy of Educational Objectives”. New
York: Longman.
3. Navarro, Rosita L, and Rosita G. Santos, (2012). “Assessment of Learning Outcomes 1,”
2nd Edition, Lorimar Publishing House, Manila, Philippines.
4. Baker, E. L. (1992), “The Role of Domain Specifications in Improving the Technical
Quality of Performance Assessment,” (CSE Tech. Rep. Los Angeles: University of
California, Center for Research on Evaluation, Standards, and Student Testing.
5. Hernon, P. and Dugan, R. “Outcomes Assessment in Higher Education.” Westport:
Libraries Unlimited, 2004.
6. Mehrens, W. A. “Using Performance Assessment for Accountability Purposes,”
Edcuational Measurement: Issues and Practices. 1992.
7. Tuckman, B. “ The Essay Test: A Look at the Advantages and Disadvantages,” NASSP
Bulletin, 1993.
8. Zaremba, S and Schultz, M. “An Analysis of Traditional Classroom Assessment
Techniques and Discussion,” ED 365404, 1993.