Cohen 1981
Cohen 1981
Cohen 1981
This paper is based on the author's doctoral dissertation conducted at The University of
I would like to thank James Kulik, Wilbert McKeachie, Robert Blackburn, and David StaΓks
for their valuable comments and suggestions.
ratings may relate more to the characteristics of the rater than to the amount learned.
The weight of the evidence, however, suggests that student ratings are not influenced
to an undue extent by external factors such as student characteristics, course
characteristics, or teacher characteristics (cf McKeachie, 1979).
Many investigators in this area have studied the relationship between ratings and
student learning. There is by no means, though, total agreement on the extent of this
relationship. In fact, Kulik and McKeachie (1975) state that "the most impressive
thing about studies relating class achievement to class ratings of instructors is the
inconsistency of the results" (p. 235). While some investigators have found strong
positive correlations between ratings and student learning (e.g., Centra, 1977; Costin,
1978; Frey, 1973), others have found equally strong negative correlations (e.g.,
Bendig, 1953b; Rodin & Rodin, 1972). Reviewers acknowledge that in general there
seem to be small to moderate correlations between ratings and learning (Kulik &
Kulik, 1974; Kulik & McKeachie, 1975; McKeachie, 1979; Seibert, 1979).
There are numerous unanswered questions concerning this body of literature. Can
overall conclusions be drawn concerning the validity of student ratings as they relate
to student learning? Or does the relationship vary depending on different circum-
stances? Does the relationship between ratings and student learning depend on the
type of rating dimension, instructor experience, subject matter, or the type of
correlation coefficient computed? At this point we do not know what factors
contribute to the diversity of findings in this area. Our ignorance is not due to the
lack of investigations conducted in this area. The relationships between ratings and
achievement have been investigated many times by many investigators. Rather, the
problem is that traditional research reviews have failed to collect studies systemati-
cally and to synthesize their results effectively.
In his presidential address to the American Educational Research Association,
Glass (1976) described an alternative to the conventional review. He referred to his
method as meta-analysis, or the analysis of analyses. He defined this method formally
as the statistical analysis of a large collection of results from individual studies for
the purpose of integrating findings. Reviewers who carry out meta-analyses fìrst
locate studies of an issue by clearly specified procedures. They then characterize the
outcomes and features of these studies in quantitative or quasi-quantitative terms.
Finally, meta-analysts use multivariate techniques to describe findings and relate
characteristics of the studies to outcomes.
In the years since Glass's address, a number of researchers have used this method
to synthesize results of psychological and educational research. For example, recent
reports of the use of meta-analysis examined effects in the following areas: mid-term
student-rating feedback (Cohen, 1980a); class size and achievement (Glass & Smith,
1979); gender differences in nonverbal communication (Hall, 1978); individualized
instruction at the college level (Cohen, Ebeling, & Kulik, 1981; Kulik, Cohen, &
Ebeling, 1980; C-L. Kulik, Kulik, & Cohen, 1980; J. A. Kulik, Kulik, & Cohen,
1979a, 1979b, 1980); open versus traditional education (Peterson, 1979); psychother-
apy and counseling (Smith & Glass, 1977); and experimenter effects (Rosenthal,
1976). A more detailed review of meta-analytic methodology and its application is
presented in Cohen (1980b).
The present study used Glass's method to synthesize research on the relationship
between student ratings of instruction and student achievement. The research in-
cluded in this synthesis came from "field" studies of actual college classes. Although
most studies were carried out in lower-division courses, a wide variety of subject
matter areas was sampled.
This meta-analytic research will enhance our understanding of the student-rating
literature in three ways: First, the synthesis will lead to general conclusions on the
overall relationship between ratings and achievement. Second, the research will show
the conditions under which the relationship is positive or negative, weak or strong.
Finally, the meta-analysis will provide some idea about the representativeness of the
literature in this area: about areas that have been studied thoroughly and areas that
have been studied too little. The results of the synthesis will be of use to administrators
and faculty members who use ratings to improve teaching. They will also be of use
to educational researchers who need a better picture of the state-of-the-art in research
on the evaluation of teaching.
Major Reviews on the Student Rating/Achievement Relationship
Review Conclusions Limitations
Centra (1979) Relationship between ratings and achieve-
ment significant, but limited range of both
variables may suppress correlations
Costin, Greenough, No comment on the rating/achievement re- Some studies use grades as
&Menges(1971) lationship achievement criterion
Doyle (1975) Fairly consistent low-to-moderate positive No account for study fea-
correlation between general ratings and ture effects
student learning
Follman (1974) Relationship between ratings and achieve- Not all studies use class as
ment about 0.40 across all school levels— unit of analysis
a "low" relationship
No distinction between rat-
ing dimensions
Gage (1974) Correlations between ratings and achieve- Not all studies uses class as
ment are positive and low to medium in unit of analysis
TABLE I—Continued
Review Conclusions Limitations
McKeachie(l979) 7 Validity of ratings reasonably encouraging
with respect to achievement on course
Mintzes(l977) 7 Weak positive correlation coefficients, av- Not all studies use class as
eraging 0.20 to 0.30 unit of analysis
Seibert(l979) 5 Students rate most highly instructors from Not all studies use class as
whom they learn most unit of analysis
instructor experience, type of rating items or instrument used, and the number of
sections used.
where both ratings and achievement are adjusted for initial student ability, although
this was the case in the Rodin and Rodin (1972) study.
Instructor Experience
Sullivan and Skanes (1974) reported different validity coefficients for experienced
instructors (those who had taught more than 1 year) and inexperienced instructors
(those who had taught less than 1 year). For experienced full-time instructors, the
correlation between ratings and student achievement was .68; for inexperienced
teachers the correlation was only .13. For a psychology course, Sullivan and Skanes
were able to compute separate coefficients for full-time psychology faculty and
graduate student (part-time) instructors. Again, they found that for the full-time
faculty, the validity coefficient was quite high (.53); for the graduate student instruc-
tors, the validity coefficient was trivial (r = .01). A number of reviewers (e.g., Kulik
& Kulik, 1974; Seibert, 1979) also suggested that the different degrees of instructor
experience found in different validity studies contributed to the diversity of their
Rating Instrument Bias
Although some multisection validity studies made use of standardized rating
instruments and scales, others used teacher-constructed scales or even single-item
ratings. Marsh and Overall (1980) maintained that the lack of consistent results for
this body of studies may be due to a lack of well-defined factor structures in the
instruments used in many studies.
Number of Sections
A number of reviewers have commented on the small number of sections on which
multisection validity studies are typically based. For instance, Vecchio (1980) said
that he places "little confidence" in the magnitude of the obtained relationships
because of the instability of correlations derived from small sample sizes. Similarly,
Marsh and Overall (1980) concluded that the small number of sections in most
validity studies is not adequate and contributes to the variability in findings. Kulik
and McKeachie (1975) further pointed out that large correlations (positive or
negative) tend to occur when sample sizes are small; more modest correlations appear
when adequate sample sizes are used. Finally, Doyle (1975) suggested that in order
to derive a stable validity coefficient, at least 30 sections should be used in a
multisection study.
features (e.g., source of publication; study year) that have correlated with outcomes
in other meta-analyses will be explored in the present study.
This section describes the procedures used to locate studies, to determine which
studies would be included in the analyses, to describe study characteristics, to
quantify outcomes of these studies, and to analyze the data.
Locating Studies
The first step in the meta-analysis was to locate as many studies as possible that
dealt with the relationship between student ratings of instruction and student
achievement. The primary sources for these studies were the major reviews listed in
Table I and three library data bases computer-searched through Lockheed's
DIALOG Online Information Service. The data bases included: (a) Comprehensive
Dissertation Abstracts; (b) ERIC, a data base on educational materials from the
Educational Resources Information Center, consisting of the two files Research in
Education and Current Index to Journals in Education; and (c) Psychological Abstracts.
The investigator developed a special set of key words for each computer search in
order to take into account the distinctive features of the different data bases. For
example, in the ERIC data base the key words included: "academic achievement" or
"grades," "higher education," and "course evaluation" or "student evaluation of
teacher performance." Branching from the bibliographies in articles located through
the original searches provided a third source of studies for the meta-analysis. In
addition, the investigator monitored recent issues of relevant educational and psy-
chological journals.
In all, the bibliographic searches yielded a total of approximately 450 titles. Most
of the articles, however, failed in one way or another to meet the criteria established
for the analysis. On the basis of information about the articles contained in titles or
abstracts, the initial pool of 450 titles was reduced to 105 potentially useful documents.
The investigator obtained copies of these 105 documents and read them in full. Of
the 105 reports, 41 contained data that could be used in the meta-analysis. These 41
documents reported on 68 separate mulîisection courses relating student ratings of
instruction to student achievement. The 41 studies are listed in Table II.
Studies Used in the Meta-analysis
Overall Ratings Specific Ratings
Study Correlated with Correlated with
Achievement Achievement
Bendig (1953a) OC, OI
Bendig(l953b) OI
Benton<fe Scott (1976) OC, OI SK, R, ST, D, I, E, SP
Bolton, Bonge, & Man (1979) OC, OI SK, R, ST, E
Borg& Hamilton (1956) OI
Brasķamp, Caulley, & Costin (1979) OC, OI SK, R, ST, I
Bryson(1974) OI SK, R, ST, I, F, E
Centra (1977) OC, OI SK, R, ST, D, E
Chase & Keene (1979) OI SK, ST, D, E, SP
Cohen &Berger (1970) OC, OI SK, ST, D, I
Costin (1978) OI SK
Crooks & Smock (1974) OI
Doyle & Crichton (1978) OI SK, R, I, SP
Doyle & Whitely (1974) OI SK, R
Elliott (1950) OC, OI SK, R, F, E
Ellis & RickaΓd (1977) OC, OI SK
Endo & Della-Piana (1976) OI SK, E
Frey(1973) OI SK, R, ST, D, E, SP
Frey(1976) OI SK, R, ST, D, E, SP
Frey, Leonard, & Beatty (1975) OI SK, R, ST, D, E, SP
Greenwood et al. (1976) OI ST, D
Grush& Costin (1975) OI SK
Hoffman (1978) OC, OI SK, I, E
Marsh, Fleiner, & Thomas (1975) OC, OI SK, R, ST, D, SP
Marsh & Overall (1980) OC, OI ST, D, I, E, SP
McKeachie, Lin, & Mann (1971) OI SK, R, ST, D, I, F
Mintzes(l977) OI SK, R, ST, D, I, F, SP
Morsh, Burgess, & Smith (1956) OI SK, R
Murdock(l969) OI
Rankin(l965) OC SP
Remmers, Martin, & Elliott (1949) OC, OI SK, F, E
Reynolds & Hansvick (1978) OI
Rodin & Rodin (1972) OI
Rubinstein & Mitchell (1970) OC, OI I
Solomon, Rosenberg, & Bezdek OI R, SP
Sorge<fe Kline (1973) OI SK, R, ST, I
Spencer & Dick (1965) OI
Sullivan & Skanes (1974) OI
Turner & Thompson (1974) OI D
Wherry (1952) OI
Whitely & Doyle (1979) OI
Note. Rating designations are: OC = Overall Course, OI = Overall Instruc
tor, SK = Skill, R = Rapport, ST = Structure, D = Difficulty, I = Interaction,
F = Feedback, E = Evaluation, SP = Student Progress.
analysis most studies reported a single effect size. Only 10 of the 41 studies provided
data for more than one multisection course. For these studies, averaging effect sizes
across multisection courses would provide greater independence among studies, but
at a potential loss of conceptual meaning. Therefore, the investigator chose to
calculate an effect size for each multisection course rather than for each paper.
Categories for Describing Studies and Number of Multisection Courses in Each Category
Coding Category Number of Courses
Methodological Features
Assignment of students to sections
No evidence of equivalence 47
Evidence of equivalence 7
Random assignment 14
Control for scoring bias in achievement criterion
Nonobjective test 9
Objective test 38
Control for author bias in achievement criterion
Departmental test 58
Standardized test 6
Control for bias in evaluating achievement
Tests graded by teacher 10
Tests graded by external evaluator 28
Control for rating instrument bias
Nonstandardized ratings 23
Standardized ratings 45
Statistical control for ability
No 43
Yes 25
Control for prior knowledge of instructor
No 26
Yes 6
Time at which ratings administered
After final grades 4
Before final grades 56
Length of instruction
Fraction of a semester 2
Whole semester 66
Teacher autonomy
Responsible for component of instruction 13
Responsible for all instruction 55
Number of sections
Less than 10 24
10-19 19
More than 19 25
Overall study quality
Low 7
Moderate 54
High 7
Ecological conditions
Content emphasis on "hard" discipline
"Soft" discipline 33
"Hard" discipline 35
Content emphasis on "pure" knowledge
Applied 7
Pure 61
The next task in the meta-analysis was to describe quantitatively the outcomes of
the studies in the sample. The major outcomes of interest were: (1) the relationship
between student achievement and overall instructor rating; (2) the relationship
between student achievement and overall course rating; and (3) the relationship
between student achievement and different rating dimensions commonly found in
factor-analytic studies of student ratings.
The student achievement measure most commonly used in the studies was a
common final examination. In some cases a cumulative point total based on a
number of tests throughout the course was used. Final grades were only used as an
indicator of achievement if they were based strictly on objective achievement criteria
(e.g., total points derived from criteria such as exams, papers, lab reports). If the
article presented data for more than one achievement measure, final examination
scores were given preference, followed by total points, and then final grades.
Rating data were collected for both overall ratings and more specific rating
dimensions. The overall ratings were of two types: an overall instructor rating and an
overall course rating. Overall instructor rating data came from either a single rating
item concerning overall teaching effectiveness (e.g., "The instructor is an excellent
teacher") or from an average of all items or dimensions relating to the instructor's
effectiveness in a particular study. Overall course ratings were derived similarly.
Most commonly, a single rating item was used (e.g., "This is an excellent course").
Data were also collected for six dimensions of teaching. Kulik and McKeachie
(1975) identified four of these dimensions as "common" factors in their review of
factor-analytic studies of student ratings. These four dimensions are Skill, Rapport,
Structure, and Difficulty. The other two dimensions, Interaction and Feedback, were
described and interpreted by Isaacson et al. (1964). The six dimensions were defined
as follows:
1. Skill. The Skill dimension represents the overriding quality to which students
respond when rating instructors. Typical items are: "The instructor has a good
command of the subject matter." "The instructor gives clear explanations." "The
instructor teaches near the class level."
2. Rapport. The Rapport dimension includes items dealing with a teacher's em-
pathy, friendliness, approachability, and accessibility. Sample items are: "The instruc-
tor is friendly." "The instructor is permissive and flexible." "The instructor is
available to talk with students outside of class."
3. Structure. The Structure dimension describes how well the instructor planned
and organized the course. Typical items are: "The instructor has everything going
according to schedule." "The instructor uses class time well." "The instructor explains
course requirements."
4. Difficulty. The Difficulty dimension deals with the amount and difficulty of the
work the teacher expects of students. Typical items are: "The instructor assigned
difficult reading." "The instructor asked for more than students could get done."
"This course required more work than others of comparable credit hours."
5. Interaction. The Interaction dimension measures the degree to which students
are encouraged to share their ideas and become actively involved in class sessions.
Typical Interaction items are: "The instructor encourages students to express various
points of view." "The instructor encourages students to volunteer their own opinions."
"The instructor facilitates classroom discussion."
6. Feedback. The Feedback dimension measures the instructor's concern with the
quality of students' work. Standard items for this dimension are: "The instructor tells
students when they have done a particularly good job." "The instructor checks to see
if students have learned well before going on to new material." "The instructor keeps
students informed of their progress."
In addition to these six dimensions, data were collected on students' self-ratings of
their learning and student attitudes toward the subject being studied. If a study
presented results with other rating dimensions, these additional results were also
Data Analysis
The basic measure of effect size was Pearson's product-moment correlation. For
each rating dimension, mean class achievement was correlated with mean class
rating. Procedures outlined by Glass (1978) were used to convert various summary
statistics (e.g., t values, F values, chi-squared values) into product-moment correla-
tions. The use of these algebraic transformations resulted in a greater number of
usable studies in the final sample. Before conducting statistical analyses, Fisher's z-
transformation was applied to all correlation coefficients based on procedures
suggested by Glass and Stanley (1970). After performing the appropriate analysis,
Fisher Z scores were transformed back into the more interpretable correlation
Two sets of analyses were performed on the data. The first set of analyses described
the overall size and significance of the rating/achievement correlations for the
different rating dimensions. The second set of analyses determined the effect of study
characteristics on the magnitude of the rating/achievement correlations using corre-
lational and multiple regression techniques.
This section reports results of statistical analyses concerning the rating/achieve-
ment correlations. Findings are described in two areas: (a) overall effects and (b)
study characteristics and effect sizes.
Overall Effects
One of the major goals in meta-analysis is to reach overall conclusions about the
magnitude of effects. In this first set of analyses, descriptive statistics were used to
determine the overall size and significance of rating/achievement correlations for the
two general dimensions, seven specific teaching dimensions, and students' self-ratings
of their learning. The overall mean correlations, the number of multisection courses
on which the means are based, and the 95 percent confidence interval on the mean
population correlations are presented in Table IV.
Overall course rating. Correlations between an overall course rating and student
achievement were available for 22 of the 68 multisection courses located for this
meta-analysis. For 20 of the 22 courses, overall course rating was positively correlated
with student achievement; the correlation between overall course rating and student
achievement was negative in two courses. A total of 11 of the 22 correlations were
Mean Rating/Achievement Correlational Effect Sizes
Rating Dimension N Mean Correlation 95% Confidence Interval
Overall Course 22 0.47 0.09, 0.73
Overall Instructor 67 0.43 0.21,0.61
Skill 40 0.50 0.23, 0.70
Rapport 28 0.31 -0.07,0.61
Structure 27 0.47 0.11,0.72
Difficulty 24 -0.02 -0.42, 0.39
Interaction 14 0.22 -0.36, 0.67
Feedback 5 0.31 -0.79, 0.94
Evaluation 25 0.23 -0.18,0.58
Student Progress 14 0.47 -0.08, 0.80
6 .
• \
\ \\
0 1 1 1 1 l/ 1 1 1 1 1 1 I \
¯ ¯ . 9 - . 7 - . 5 - . 3 ¯ ¯ . l +.1 +.3+.5+.7*.9
FIGURE 1. Distribution of Course/Achievement correlations for 22 courses.
student achievement; for eight courses the correlation was negative. For 31 courses
the correlation coefficient reached statistical significance, and in 30 of those courses
it was significantly positive. Under the null hypothesis of no relationship between
overall instructor rating and student achievement, these results are very unlikely.
The average correlation between overall instructor rating and student achievement
for the 67 multisection courses was .43, a moderately large effect. The 95 percent
confidence interval on the true population correlation ranged from .21 to .61. The
distribution of these 67 correlations is presented in Figure 2. Over half of the courses
had large positive correlations. Instructors whose students achieved the most were
also the ones who tended to receive the highest instructor ratings.
Skill Correlations between Skill ratings and student achievement were generated
for 40 courses. Skill was positively correlated with student achievement in 37 courses;
it was negatively correlated with achievement in three courses. For 20 of the 40
courses the correlation coefficient was statistically significant, and in all of these
courses Skill ratings and achievement were positively related. For the 40 courses, the
average correlation equalled .50, a large effect.
24 L
22 L A
20 L \
18 L \
- \
zu L \
UJ ļ \
Ŝ12 L \
£io I \
8 I
6 \
Γ \
2 L
0 I 1 )S 1 Ni^^l I I I I I I
¯ . 9 - . 7 - . 5 - . 3 - . l +.1 + . 3 + . 5 + . 7 + . 9
FIGURE 2. Distribution of Instructor/Achievement correlations for 67 courses.
Difficulty and achievement significant, and that was in a negative direction. The null
hypotheses of no relationship between Difficulty and student achievement could not
be rejected. The mean correlation between Difficulty and student achievement was
Interaction. For 14 courses, correlations between Interaction and student achieve-
ment were available. In 12 courses Interaction was positively related to achievement;
in one course it was negatively related to achievement; and in one course the
correlation between Interaction and achievement was zero. The correlation was
significant in four courses, and in each case it was significantly positive. The mean
correlation between Interaction and student achievement was .22 for the 14 courses.
Feedback. Only five courses provided correlations between Feedback and student
achievement. The correlations were positive for all five courses. However, only one
course showed a statistically significant positive correlation. The mean correlation
between Feedback and student achievement was .23 for the five courses.
Evaluation. One other rating dimension, Evaluation, was correlated with student
achievement in a number of studies. The Evaluation dimension measures the extent
to which students feel the evaluation instruments (e.g., papers, examinations) fairly
assess their ability. Twenty-five studies reported correlations between Evaluation and
student achievement. In 20 courses Evaluation was positively correlated with achieve-
ment; in three courses it was negatively correlated with achievement; and in two
courses the correlation between Evaluation and achievement was zero. Only four
courses showed statistically significant correlations, and in each case Evaluation was
positively correlated with student achievement. The mean correlation between Eval-
uation and student achievement was .23 for the 25 studies.
Student progress. It was also of interest to determine how well students' self-ratings
of their learning corresponded with their achievement. Correlations between Student
Progress and achievement were available for 14 courses. In 10 courses Student
Progress was positively correlated with student achievement; in two courses it was
negatively correlated; and in two courses the correlation between Student Progress
and achievement was zero. The correlations were statistically significant in four
courses, and in each case Student Progress was positively correlated with achieve-
ment. The mean correlation between Student Progress and student achievement was
.47 for the 14 courses.
Summary of overall effects. We can be relatively certain that the general course
and instructor dimensions relate quite strongly to student achievement. For both of
these dimensions, the mean rating/achievement correlational effect size is moderately
large, and the 95 percent confidence intervals around the true population means do
not span zero. This magnitude of effect size does not hold up for all teaching
dimensions, however. While large effect sizes are found for the Skill and Structure
dimensions, other dimensions such as Rapport, Interaction, Feedback, and Evalua-
tion show more modest effects. The Course Difficulty dimension shows no relation-
ship with student achievement. Finally, students' self-ratings of their learning corre-
late quite highly with student achievement.
analyses were conducted to determine whether studies that reported large effect sizes
differed systematically from those which produced small effect sizes. As a first step
in this set of analyses, zero-order correlations were computed between the 20 study
characteristic variables and the rating/achievement effect sizes. Then, to investigate
the possibility that a combination of variables might predict effect sizes more
accurately than a single predictor, a hierarchical multiple regression analysis (Cohen
& Cohen, 1975) was conducted. The hierarchical model requires the analyst to
specify in advance the order in which the independent variables enter the regression
equation. The model determines the partial correlation coefficients of each inde-
pendent variable at the point where the variable enters the equation, while also
indicating the cumulative R2. Thus, the hierarchical procedure shows the unique
contribution of a specific independent variable to the total variance of the dependent
variable, when previously entered independent variables have been partialled.
The investigator selected the hierarchical multiple regression strategy for two
reasons. First, an examination of the correlation matrix showed that many of the
study characteristic variables were substantially intercorrelated. This problem of
multicollinearity is best dealt with by an ordered variance partitioning procedure
(Cohen & Cohen, 1975). Second, the hierarchical model is most appropriate when
independent variables can be ordered with regard to their causal priority. For the
present meta-analysis, the independent variables in the regression came from the set
of 19 variables used to describe characteristics in the sample of multisection courses.
The dependent variable in the regression analysis was the rating/achievement
correlation for the overall instructor rating dimension. Study characteristics that have
been hypothesized by other researchers to influence the magnitude of the rating/
achievement correlation were initially entered into the regression model. Following
this set of variables, the remaining study characteristics were entered. This resulted
in the following hierarchical ordering: Set A (knowledge of instructor, student
assignment, instructor experience, instructor autonomy, number of sections, control
for ability, timing of ratings, rating instrument bias); Set B (hard science, pure
knowledge, life studies, author bias, scoring bias, evaluation bias); and Set C
(institution, study year, source of study). The overall study quality variable was not
used in the regression analysis because it was based on other entered variables. In
addition, two study characteristic variables—course level and length of instruction—
could not explain the variation in rating/achievement correlations because there was
little variation on these study characteristics, and therefore, they were not used in the
regression analysis.
The hierarchical multiple regression procedure identified which independent vari-
ables significantly contributed to the variance of the overall instructor rating/achieve-
ment correlations. These significant variables were then entered into a separate
multiple regression equation. From this regression analysis, a prediction equation
and percent of total variance accounted for were computed.
The correlations between study characteristics and overall instructor rating/
achievement effect sizes for 67 courses are presented in Table V. Three variables
correlated significantly with effect size: control for bias in evaluating achievement;
time at which the ratings were administered; and instructor experience. The results
of the hierarchical regression analysis showed that only these three variables contrib-
uted significantly to the variance in effect sizes. Together, the three study character-
istic variables accounted for 31 percent of the variance in overall instructor rating/
achievement correlations. The regression model including these variables produced
the following equation (t values given in parentheses):
1.183 + .088 (instructor experience)
— .853 (timing of ratings) + .419 (evaluation bias).
(3.96) (2.72)
This model shows that for graduate student instructors, the correlations between
overall instructor rating and student achievement averaged .34, while for full-time
faculty the correlation was .48. In terms of the time at which the ratings were
administered, the average correlation was much higher when students knew their
final grades (.85) than when they did not know their final grades (.38). When
achievement tests were graded by students' own instructors, the correlation between
overall instructor ratings and achievement was .15. The correlation averaged .52
when an external grader was used.
The present meta-analysis provides strong support for the validity of student
ratings as measures of teaching effectiveness. Teachers whose students do well on
achievement measures receive higher instructional ratings than teachers whose
Correlations of Study Characteristics With Overall Instructor/Achievement Effect Sizes
(N = 67)
Study Characteristic Correlation with Effect Size
Assignment of students to sections -0.04
Control for scoring bias in achievement criterion 0.12
Control for author bias in achievement criterion 0.12
Control for bias in evaluating achievement 0.29*
Control for rating instrument bias 0.15
Statistical control for ability 0.05
Control for prior knowledge of instructor 0.03
Time at which ratings administered —0.43**
Length of instruction -0.16
Teacher autonomy 0.12
Number of sections —0.14
Overall study quality -0.04
Content emphasis on hard discipline —0.01
Content emphasis on pure knowledge 0.06
Content emphasis on life studies -0.06
Course level 0.13
Institutional setting -0.06
Instructor experience 0.25*
Source of study —0.04
Publication year 0.10
* p < 0.05
** p < 0.001
Overall Effects
The use of meta-analytic techniques also makes it possible to reach more exact
conclusions about the rating/achievement relationship. The first set of analyses in
the present investigation reported on the overall size and significance of rating/
achievement effects. Of prime importance was the correlation between the overall
instructor rating and student achievement. For 67 multisection courses the correlation
averaged .43, a moderately large effect.The magnitude of this correlation is probably
about as high as can be expected considering the restricted range of both mean
achievement scores and mean instructor ratings among different sections of a course.
Thus, in the typical study there was a strong tendency for students to rate most highly
teachers from whom they learned most.
In addition to determining the overall rating/achievement relationship, the meta-
analysis also focused on the degree to which the more specific instructional rating
dimensions related to student achievement. The obtained results suggest that certain
aspects of teaching, as measured by student ratings, are more related to learning than
are others. Correlational effect sizes for both the Skill and Structure dimensions were
large, .50 and .47, respectively. It is not surprising that Skill ratings, which measure
teacher's instructional competence, correspond well with student achievement. We
would expect that the more skilled instructors facilitate greater learning in their
students than instructors who are less adept. Perhaps not as evident is the strong
Of particular interest in the present study were methodological features that other
investigators have hypothesized affect the rating/achievement relationship. For
instance, Leventhal (1975) maintained that random assignment of students is neces-
sary to be able to attribute differences in student achievement to different teachers.
When students self-select into sections with knowledge of teachers' reputations, the
relationship between ratings and achievement may become confounded. The meta-
analysis showed that studies in which students were randomly assigned to sections
produced findings no different from those of studies that did not control enrollment
procedures. Furthermore, whether or not students knew prior to enrollment which
instructors were teaching the course did not relate to effect size. Although random
assignment is preferable in multisection validity designs, it is often difficult to achieve
under the constraints of student registration procedures. Therefore, most studies in
this area have not randomly assigned students to sections. The present findings
suggest, though, that student section selection factors do not contribute to any
systematic bias in generalizing an overall rating/achievement effect.
Whether or not researchers statistically controlled for initial differences in student
ability did not affect the magnitude of the rating/achievement correlation. This result
supports the preliminary findings of Kulik and Kulik (1974). Based on nine inde-
pendent studies they calculated a median for both adjusted (part or partial) correla-
tions and unadjusted (raw) correlations. The median adjusted correlation found in
these studies was .27; the median unadjusted correlation was .23. Although the
present meta-analysis found a larger overall effect size than did the Kuliks, there still
was little difference between adjusted or unadjusted rating/achievement correlations.
The studies in the sample employed a variety of rating instruments, scales, and
individual items. Marsh and Overall (1980) have maintained that differences in
rating/achievement correlations may be due to the lack of well-defined factor
structures in most of the rating scales used. The present findings do not support this
speculation. First of all, nearly three-quarters of the studies used some sort of
standardized ratings. Only 11 of 41 studies used single-item or experimenter-con-
structed ratings. More importantly, there was no difference in the size of correlational
effects between studies using standardized ratings and those using unstandardized
Some reviewers have been concerned that rating/achievement correlations vary
according to the number of sections used in the study. In the present meta-analysis,
the number of sections on which correlations were based ranged from five to 121.
The relationship between number of sections and effect size was nonlinear. Actually,
number of sections correlated significantly with the absolute value of effect size;
studies using small numbers of sections tended to report either large positive cr large
negative correlational effects. This supports the conclusions of other reviewers (e.g.,
Kulik & McKeachie, 1975; Marsh & Overall, 1980; Vecchio, 1980) that results from
studies using small numbers of sections are quite variable and difficult to interpret.
In the present instance, when including only studies that used 20 or more sections,
the average correlation between overall instructor rating and achievement was .37.
This compares quite favorably to the average effect computed over all studies.
The other methodological variables had no effect on the size of the rating/
achievement correlation. Studies using objective achievement examinations produced
results similar to those of studies using essay tests. Whether a departmental or a
standardized examination measured student achievement did not make a difference
in effect sizes. Nor did the degree of teacher autonomy affect study outcomes. It did
not matter whether teachers were responsible for all instruction or only a component
of instruction. Finally, overall study quality, which was based on a composite of all
methodological variables, did not significantly relate to the magnitude of correlational
effect sizes.
The methodological features discussed above have implications for the design and
inteφretation of multisection validity studies. The meta-analysis showed that it is
important to set controls for certain extraneous influences. For instance, we should
be cautious in generalizing from results of studies that have not controlled for factors
such as the timing of ratings and evaluation bias. We should also place more
confidence in results of studies that use an adequate number of sections. Not
controlling for other potential extraneous factors such as student section selection,
rating instrument bias, and differences in initial student ability did not seem to
threaten the external validity of the sample of studies included in the present meta-
Although year of publication ranged over a 30-year period, most of the studies had
been conducted within the last decade. More recent studies did not produce effects
greatly different from those of earlier studies. Nor was publication source associated
with effect size. Studies published in journals and unpublished studies reported
similar findings.
Bendig, A. W. The relation of level of course achievement to students' instructor course ratings
in introductory psychology. Educational and Psychological Measurement, 1953, 13, 437-448.
Costin, F., Greenough, W. T., & Menges, R. J. Student ratings of college teaching: Reliability,
validity, and usefulness. Review of Educational Research, 1971, 41, 511-535.
Crooks, T. J., & Smock, H. R. Student ratings of instructors related to student achievement.
Urbana, 111.: Office of Instructional Resources, University of Illinois, 1974.
Doyle, K. O. Student evaluation of instruction. Lexington, Mass.: D. C. Heath, 1975.
Doyle, K. O., & Crichton, L. I. Student, peer, and self-evaluations of college instructors. Journal
of Educational Psychology, 1978, 70, 815-826.
Doyle, K. O., & Whitely, S. E. Student ratings as criteria for effective teaching. American
Educational Research Journal, 1974, 11, 259-274.
Elliott, D. N. Characteristics and relationships of various criteria of college and university
teaching. Purdue University Studies in Higher Education, 1950, 70, 5-61.
Ellis, N. R., & Rickard, H. C. Evaluating the teaching of introductory psychology. Teaching of
Psychology, 1977, 4, 128-132.
Endo, G. T., & Della-Piana, G. A validation study of course evaluation ratings. Improving
College and University Teaching, 1976, 24, 84-86.
Feldman, K. A. Grades and college students' evaluations of their courses and teachers. Research
in Higher Education, 1976, < 69-111.
Follman, J. Student ratings and student achievement. JSAS Catalog of Selected Documents in
Psychology, 1974, 4, 136. (Ms. No. 791)
Frey, P. W. Student ratings of teaching: Validity of several rating factors. Science, 1973, 182,
Frey, P. W. Validity of student instructional ratings: Does timing matter? Journal of Higher
Education, 1976, 47, 327-336.
Frey, P. W., Leonard, D. W., & Beatty, W. W. Student ratings of instruction: Validation
research. American Educational Research Journal, 1975, 12, 435-447.
Gage, N. L. Students' ratings of college teaching: Their justification and proper use. In N. S.
Glasman & B. R. Killait (Eds.), Second UCSB Conference of Effective Teaching. Santa
Barbara, Calif: Graduate School of Education and Office of Instructional Development,
University of California, Santa Barbara, 1974.
Gessner, P. K. Evaluation of instruction. Science, 1973, 180, 566-570.
Glass, G. V Primary, secondary, and meta-analysis of research. Educational Researcher, 1976,
5, 3-8.
Glass, G. V Integrating findings: The meta-analysis of research. In L. S. Shulman (Ed.), Review
of research in education (Vol. 5). Itasca, 111.: F. E. Peacock, 1978.
Glass, G. V, & Smith, M. L. Meta-analysis of research on class size and achievement.
Educational Evaluation and Policy Analysis, 1979, 1, 2-16.
Glass, G. V, & Stanley, J. C. Statistical methods in education and psychology. Englewood Cliffs,
N. J.: Prentice-Hall, 1970.
Greenwood, G. E. et al. A study of the validity of four types of student ratings of college
teaching assessed on a criterion of student achievement gains. Research in Higher Education,
1976,5, 171-178.
Grush, J. E., & Costin, F. The student as consumer of the teaching process. American
Educational Research Journal, 1975, 12, 55-66.
Guide to DIALOG searching. Palo Alto, Calif: Lockheed DIALOG Information Retrieval
Service, Lockheed Missiles & Space Company, 1979.
Hall, J. A. Gender effects in decoding non-verbal cues. Psychological Bulletin, 1978, 85, 845-
Hoffman, R. G. Variables affecting university student ratings of instructor behavior. American
Educational Research Journal, 1978, 15, 287-299.
Isaacson, R. L. et al. Dimensions of student evaluations of teaching. Journal of Educational
Psychology, 1964, 55, 344-351.
Kulik, C-L., Kulik, J. A., & Cohen, P. A. Instructional technology and college teaching.
Teaching of Psychology, 1980, 7, 199-205.
Kulik, J. A., Cohen, P. A., & Ebeling, B. J. Effectiveness of programmed instruction in higher
education: A meta-analysis of findings. Educational Evaluation and Policy Analysis, 1980,
2(6), 51-64.
Kulik, J. A., & Kulik, C-L. C Student ratings of instruction. Teaching of Psychology, 1974, I,
Kulik, J. A., Kulik, C-L. C , & Cohen, P. A. A meta-analysis of outcome studies of Keller's
personalized system of instruction. American Psychologist, 1979, 34, 307-318. (a)
Kulik, J. A., Kulik, C-L. C , & Cohen, P. A. Research on audio-tutorial instruction: A meta-
analysis of comparative studies. Research in Higher Education, 1979, 11, 321-341. (b)
Kulik, J. A., Kulik, C-L. C , & Cohen, P. A. Effectiveness of computer-based college teaching:
A meta-analysis of findings. Review of Educational Research, 1980, 50, 525-544.
Kulik, J. A., & McKeachie, W. J. The evaluation of teachers in higher education. In F. N.
Kerlinger (Ed.), Review of research in education, (Vol. 3). Itasca, 111.: Peacock, 1975.
Leventhal, L. Teacher rating forms: Critique and reformulation of previous validation designs.
Canadian Psychological Review, 1975, 16, 269-276.
Leventhal, L., Abrami, P., & Perry, R. Bogus evidence for the validity of student ratings. Paper
presented at the annual meeting of the American Psychological Association, San Francisco,
August 1977. (ERIC Document Reproduction Service No. ED 150 510)
Leventhal, L. et al. Section selection in multi-section courses: Implications for the validation
and use of teacher rating forms. Educational and Psychological Measurement, 1975, 35, 885-
Marsh, H. W. Research on students' evaluations of teaching effectiveness: A reply to Vecchio.
Instructional Evaluation, 1980, 4(2), 5-13.
Marsh, H. W., Fleiner, J., & Thomas, C S. Validity and usefulness of student evaluations of
instructional quality. Journal of Educational Psychology, 1975, 67, 833-839.
Marsh, H. W., & Overall, J. U. Validity of students' evaluations of teaching effectiveness:
Cognitive and affective criteria. Journal of Educational Psychology, 1980, 72, 468-475.
Mauger, P. A., & Kolmodin, C A. Long-term predictive validity of the Scholastic Aptitude
Test. Journal of Educational Psychology, 1975, 67, 847-851.
McKeachie, W. J. Student ratings of faculty: A reprise. Academe, 1979, 65, 384-397.
McKeachie, W. J., Lin, Y-G., & Mann, W. Student ratings of teacher effectiveness: Validity
studies. American Educational Research Journal, 1971, 8, 435-445.
Mintzes, J. J. Field test and validation of a teaching evaluation instrument: The Student Opinion
of Teaching. Windsor, Ontario: University of Windsor, 1977. (ERIC Document Reproduction
Service No. ED 146 185)
Morsh, J. E., Burgess, G. G., & Smith, P. N. Student achievement as a measure of instructor
effectiveness. Journal of Educational Psychology, 1956, 47, 79-88.
Murdock, R. P. The effect of student ratings of their instructor on the student's achievement and
rating. Salt Lake City, Ut.: University of Utah, 1969. (ERIC Document Reproduction Service
No. ED 034 715)
Peterson, P. L. Direct instruction reconsidered. In P. L. Peterson & H. J. Walberg (Eds.),
Research on teaching. Berkeley, Calif.: McCutchan, 1979.
Rankin, E. F., Greenmum, R., & Tracy, R. J. Factors related to student evaluations of a college
reading course. Journal of Reading, 1965, 9, 10-15.
Remmers, H. H., Martin, F. D., & Elliott, D. N. Are students' ratings of instructors related to
their grades? Purdue University Studies in Higher Education, 1949, 66, 17-26.
Reynolds, D. V., & Hansvick, C Graduate instructors who grade higher receive lower evaluations
by students. Paper presented at the annual meeting of the American Psychological Association,
Toronto, Ontario, September 1978.
Rodin, M., & Rodin, B. Student evaluations of teachers. Science, 1972, 777, 1,164-1,166.
PETER A. COHEN, Assistant Director OISER, Adjunct Assistant Professor of
Psychology, Dartmouth College, Webster Hall, Hanover, NH 03755. Specializa-
tion: Instructional evaluation; research on college teaching; research synthesis.