STATISTICAL PROCEDURES TO EVALUATE QUALITY OF SCALE ANCHORING

Shelby J. Haberman; Sandip Sinharay; Yi‐Hsuan Lee

Research Report ETS RR–11-02 Statistical Procedures to Evaluate Quality of Scale Anchoring Shelby J. Haberman Sandip Sinharay Yi-Hsuan Lee January 2011 Statistical Procedures to Evaluate Quality of Scale Anchoring Shelby J. Haberman, Sandip Sinharay, and Yi-Hsuan Lee ETS, Princeton, New Jersey January 2011 As part of its nonprofit mission, ETS conducts and disseminates the results of research to advance quality and equity in education and assessment for the benefit of ETS’s constituents and the field. To obtain a PDF or a print copy of a report, please visit: http://www.ets.org/research/contact.html Technical Review Editor: Matthias von Davier Technical Reviewers: Isaac Bejar and Andreas Oranje Copyright © 2011 by Educational Testing Service. All rights reserved. ETS, the ETS logo, GRE, LISTENING. LEARNING. LEADING., and PRAXIS I are registered trademarks of Educational Testing Service (ETS). PRAXIS and TOEFL IBT are trademarks of ETS. Abstract Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement (Carroll, 1993). Scale anchoring (Beaton & Allen, 1992), a technique that describes what students at different points on a score scale know and can do, is a tool to provide such information. Scale anchoring for a test involves substantial amount of work, both by the statistical analysts and test developers involved with the test. In addition, scale anchoring involves considerable use of subjective judgment, so its conclusions may be questionable. This paper describes statistical procedures that can be used to determine if scale anchoring is likely to be successful for a test. If these procedures indicate that scale anchoring is unlikely to be successful, then there is little reason to perform a detailed scale anchoring study. The procedures are applied to several data sets from a teacher licensing test. Key words: augmented subscore, distinctness, mean-squared error, proportional reduction in mean-squared error, reliability i Acknowledgments We are grateful to Daniel Eignor, Matthias von Davier, Isaac Bejar, and Andreas Oranje for their helpful advice, to Kevin Larkin and Titus Teodorescu for their help with the data, to Susan Embretson for sending us a computer program, and to Ruth Greenwood and Kim Fryer for editorial help. ii Testing companies are under constant pressure to produce information in addition to the overall test score. Subscores (e.g., Sinharay, Haberman, & Puhan, 2007) are one potential source of such additional information. Another source is information for test takers and test score users concerning the type of tasks examinees at specified score levels are typically able to perform. Although such information might appear to be readily supplied, in practice the task has been a persistent problem in educational and psychological measurement (Carroll, 1993). Testing companies have been investigating solutions to this problem through the development of proficiency scaling procedures and question-difficulty research. Scale anchoring (Beaton & Allen, 1992), which results in descriptions of what students at different points on a score scale know and can do, is a tool to provide such information concerning the relationship between tasks examinee can perform and observed TM test scores. For example, a scale anchoring study for the TOEFL iBT Reading section (Garcia Gomez, Noah, Schedl, Wright, & Yolkut, 2007) found, among other things, that the test-takers who obtain a high score (22-30) in TOEFL iBT Reading typically have a very good command of academic vocabulary and grammatical structure. Scale anchoring has been used with a variety of assessments, including the National Assessment of Educational Progress (NAEP; Beaton & Allen, 1992) and the Trends in International Mathematics and Science Study (TIMSS; Kelley, 2002). The procedure of scale anchoring produces performance-level descriptors or PLDs (Perie, 2008), which describe the level of knowledge and skills required of different performance levels. The U. S. Government’s No Child Left Behind (NCLB) Act of 2001 demands, among other things, that students should receive diagnostic reports that allow teachers to address their specific academic needs; scale anchoring could be used in such a diagnostic report. Some researchers (e.g., Sinharay & Haberman, 2008) recommended consideration of scale anchoring for tests that are under pressure to report additional information, but do not have high-quality subscores. Nonetheless, scale anchoring is not without problems. Linn and Dunbar (1992) described the confusion of the general public about the meaning of NAEP data related to score anchors. They concluded that the reasons for the discrepancy between the percentage 1 of examinees who answer an anchor item correctly and the percentage who score above the corresponding anchor point may be too subtle for mass communication. Phillips et al. (1993) described the potential danger of overinterpreting examinee performance at anchor points so that all examinees at a particular level are assumed to be proficient at all abilities measured at that level. The steps required in scale anchoring are the following: 1. Select a few carefully dispersed points on the score scale (anchor points) that will be anchored. 2. Find examinees who score near each anchor point. 3. Examine each item to see if it discriminates between successive anchor points, that is, if most (greater than 50%) of the students at the higher score levels can answer it correctly and most (less than 50%) of the students at the lower level cannot. 4. Review the items that discriminate between adjacent anchor points to find out if specific tasks or attributes that they include can be generalized to describe the level of proficiency at the anchor point. What students at various scale points know and can do can be summarized this way. The above description shows that scale anchoring involves a statistical component (the first three steps) that identifies items that discriminate between successive points on the proficiency scale using specific item attributes (Beaton & Allen, 1992). These steps are closely related to the common process of item mapping. The fourth step involves generalizations not required in item mapping. Scale anchoring involves a consensus component in which identified items are used by subject-area and educational experts to provide an interpretation of what groups of students at or close to the selected scale points know and can do. This consensus component can be costly (because of the involvement of subject-area and educational experts) and can be quite time-consuming. In addition, the subjective judgment involved may not be reliable. 2 As Beaton and Allen (1992) noted, the scale anchoring process is not guaranteed to result in useful descriptions of the anchor points. A test that is well-designed for its intended purpose may not have sufficient information available to differentiate between performance of examinees at given score levels on items with different attributes. In some cases, this failure may simply reflect lack of a sufficient number of items anchoring at given score levels. It may also be true that the items at an anchor level are too dissimilar to interpret. Therefore, before performing an exhaustive scale anchoring study, it may be beneficial if a set of simple statistical analyses can be performed to find out if a scale anchoring will provide useful results. This paper suggests such a set of analyses—they include simple regression analysis and fitting of several popular item response theory (IRT) models. The next section discusses our suggested set of techniques and describes why they are appropriate. The techniques are applied to several data sets from a teacher licensing test in the application section. Conclusions and recommendations are provided in the last section. 1 Methods to Predict Success of Scale Anchoring The description of scale anchoring given in the previou section indicates that scale anchoring can only succeed (which means that it can provide useful information to the examinees) if, for each pair of successive anchor points (which correspond to a small range of ability or difficulty level of items, for example, a range of proportion correct of 0.60 to 0.75), there are items with specific attributes that most students at the lower point cannot answer but most students at the higher point can, that is, the items are highly discriminating at specific levels of difficulty. Thus scale anchoring can only succeed if item attributes can predict item difficulties to an adequate degree and if item discriminations associated with these item attributes are high. If item attributes do not predict item difficulties well, then the items discriminating between adjacent anchor points will not be readily interpreted in terms of item attributes. Unless item discriminations are consistently high, it is also necessary for item attributes to predict item discrimination. In Step 4 of 3 the description of scale anchoring, the required generalizations will not be feasible. Hence, the key to our suggested techniques is an examination of how well item attributes predict item difficulties and item discriminations. A mathematical proof of why it is necessary and sufficient for the item attributes to predict item difficulties and item discriminations for the success of scale anchoring is given towards the end of this section. The techniques that will be suggested here assume availability of test data concerning item attributes along with variables used in test development to characterize items in terms of features such as domain covered or type of tasks covered. Such variables are usually available because test developers use them to create test forms that conform to specifications. A common problem will be that the test design is likely not to be optimal for the purpose of inferences concerning item attributes. This issue will receive further attention in the concluding section. The first technique that can be used is simple linear regression of item statistics (item difficulty or item discrimination) on indicators of appropriate item attributes (see Sheehan & Mislevy, 1994, for examples of such analyses). The squared multiple correlations from these regressions will provide an idea of how well the item statistics can be predicted by the item attributes. The second set of techniques involves fitting of several item-response theory (IRT) models to the data. In the operational data examples considered later, all items are right-scored and n examinees respond to m items. Associated with Item i are item attributes qik , 1 ≤ k ≤ K, for some integer K ≥ 1. The qik are indicator variables, with qik = 1 if Attribute k is present for Item i and qik = 0 otherwise. The response of Examinee s, 1 ≤ s ≤ n, to Item i is Xis , and the latent proficiency parameter of Examinee s is a random variable θs with a standard normal distribution. Conditional on θs , the Xis are mutually independent and the probability that Xis = 1 is pis . The logit of pis is h i pis , so that λis = log 1−p is exp(λis ) . pis = 1 + exp(λis ) All models considered are special cases of the two-parameter logistic (2PL) model M2 in 4 which λis = ai θs − βi (1) for the discrimination ai and the intercept βi of Item i. If ai > 0, then bi = βi /ai is the difficulty of Item i. In the one-parameter logistic (1PL) model M1 , also known as the Rasch model, it is assumed that the discrimination parameter ai is the same for all items. In the zero-parameter logistic (0PL) model M0 , it is assumed that both the discrimination parameter ai and the intercept parameter βi are the same for all items. In the independence model MI (Haberman, 2006), it is assumed that the item discrimination parameter ai is 0 for all items, so that the Xis are mutually independent and pis does not depend on θs . Models M0 , M1 , M2 , and MI do not use the indicator variables qik . In several models, these indicators are employed to predict item parameters. In the linear logistic test model (LLTM; Fischer, 1973) ML , the Rasch model M1 is assumed, and it is assumed that the item intercept satisfies a linear model βi = η1 qi1 + η1 qi2 + · · · + ηK qiK = K X ηk qik (2) k=1 in which ηk represents the effect of Attribute k on the intercept βi of Item i. Model ML reduces to the 0PL model M0 if K = 1 and qi1 = 1 for all Items i. Model ML is the same as the Rasch model M1 if K = m and the m by K matrix Q of qik has rank m. The linear logistic test model has two generalizations to 2PL models. In the constrained 2PL model (Embretson, 1993) MC , it is assumed that the difficulty bi = X γk qik (3) k of Item i satisfies a linear model in which γk represents the effect of Attribute k and the item discrimination satisfies a linear model ai = X τk qik (4) k in which τk represents the effect of Attribute k. If qi1 = 1 for all i and K = 1, then the constrained 2PL model reduces to the 0PL model M0 . In the alternative constrained 2PL model MA , (2) is assumed to hold for some ηk and (4) is assumed to hold for some τk . 5 To compare models, the information-theoretic measure minimum estimated expected log penalty per item (MEELPPI; see, e.g., Gilula & Haberman, 2001; Haberman, 2006) may be employed. For Model Mx , the MEELPPI is obtained as Ĥx = −ℓx /(2nm), where ℓx is the maximum log-likelihood under the model. For example, Ĥ1 is the MEELPPI for Model M1 and ĤA is the MEELPPI under Model MA . Among the models under study, Ĥ0 ≥ ĤL ≥ Ĥ1 ≥ Ĥ2 , Ĥ0 ≥ ĤL ≥ ĤC ≥ Ĥ2 , Ĥ0 ≥ ĤL ≥ ĤA ≥ Ĥ2 , and ĤI ≥ Ĥ1 ≥ Ĥ2 because, for example, M0 is a special case of ML and M1 is a special case of M2 . An LLTM model is most attractive if ĤL is close to ĤA , ĤC , Ĥ1 , and Ĥ2 . The constrained 2PL model is most attractive if ĤC is close to Ĥ2 , and the alternate constrained 2PL model is most attractive if ĤA is close to Ĥ2 . Evaluation of closeness can be considered in terms of relative reduction of MEELPPI and in terms of reductions of MEELPPI per independent parameter. Let Mx have dx independent parameters, so that d0 = 2, dI = m, d1 = m + 1, d2 = 2m, dL = 1 + K, and dC = dA = 2K. If Model Mx implies Model My but the models are not equivalent, then the improvement in MEELPPI per independent parameter is νxy = Ĥx − Ĥy . dy − dx Larger values of νxy are favorable for Model My . If Model Mx implies Model My , and Model My implies Model Mz and the models are not equivalent, then one may examine the relative improvement 2 Rxyz = Ĥx − Ĥy Ĥx − Ĥz to know how My compares to Model Mz , where Mx provides a baseline for comparison 2 2 of My and Mz . Values of Rxyz near 1 are desirable. It is certainly desired that Rxyz be somewhat larger than (dy − dx )/(dz − dx ), so that the gain per independent parameter from Model Mx to Model My is somewhat larger than is the gain per independent parameter from Model My to Model Mz . For example, consider an evaluation of the linear logistic test model (Model ML ). Consider a comparison to the Rasch model (Model M1 ) where the 0PL model (Model M0 ) provides a baseline for comparison. Assume that 0 < K < m. It is desirable that ν0L = (K − 1)−1 (Ĥ0 − ĤL ) 6 be somewhat larger than is νL1 = (m − K)−1 (ĤL − Ĥ1 ). It is also desirable that 2 R0L1 = Ĥ0 − ĤL Ĥ0 − Ĥ1 be somewhat larger than (K − 1)/(m − 1). Favorable results suggest some ability to predict item intercept by use of item attributes. In the Rasch case, the ability to predict item intercept is equivalent to the ability to predict item difficulty. Similar arguments can be applied to the constrained 2PL model MC or the alternate constrained 2PL model MA . In the case of MC , it is important to examine ν0C = [2(K − 1)]−1 (Ĥ0 − ĤC ) be somewhat larger than νC2 = [2(m − K)]−1 (ĤC − Ĥ2 ). It is also desirable that 2 R0C2 = Ĥ0 − ĤC Ĥ0 − Ĥ2 be somewhat larger than (K − 1)/(m − 1). Favorable results suggest some ability to predict item difficulty and item discrimination from item attributes. In principle, it is possible to apply chi-square tests to compare models. Let Model Mx imply Model My , and let Models Mx and My not be equivalent. If Model Mx holds, then the likelihood-ratio chi-square statistic L2xy = 2nm(Ĥx − Ĥy ) has an approximate chi-square distribution on dy − dx degrees of freedom. In large samples, L2xy will be quite large even if the deviation of Model x from the data is small, so that this approach is not very helpful in practice. In all cases in this report, L2xy is highly significant. To discuss the relationship of model parameters in the 2PL model, let us consider an item that anchors at θs = ω. Then, from the earlier description of scale anchoring, the probability of a correct response is at least p for θs = ω and no more than q < p for θs = υ < ω. That means ai ω − βi ≥ log[p/(1 − p)], 7 and ai υ − βi ≤ log[q/(1 − q)]. The above inequalities imply that the discrimination parameter ai must be at least log[p/(1 − p)] − log[q/(1 − q)] , ω−υ (5) which indicates that an item with a very low discrimination parameter may not anchor at all. Given ai > 0, the intercept parameter βi must be between ai υ − log[q/(1 − q)] and ai ω − log[p/(1 − p)], so that the difficulty parameter bi must be between υ − a−1 i log[q/(1 − q)] and ω − a−1 i log[p/(1 − p)]. Suppose that the 2PL model is a reasonable approximation to the data and the item discrimination parameter ai is sufficiently large that (5) holds. For example, if ω = 0.5, υ = 0, p = 0.6, and q = 0.4, then the discrimination must be at least 1.6. In addition, unless ai is somewhat larger than 1.6, the interval for the item difficulty will be very narrow. If scale anchoring is informative for this data set, that means that this item and a few other items that anchor at θs = ω possess a few specific item attributes. That in turn implies that these item attributes determine the above mentioned bounds on ai and bi , or, in other words, that the item attributes predict item discrimination and item difficulty. On the other hand, if the item attributes predict item discrimination and item difficulty adequately, the above mentioned bounds will be associated with a few specific item attributes; these item attributes are then associated with θs = ω, which means that scale anchoring is informative for this data set. Thus, a necessary and sufficient condition 8 for scale anchoring to be informative is that item attributes predict item discrimination and item difficulty adequately. In typical cases, adequacy involves item difficulty more than item discrimination, for the requirement on item discrimination involves sufficiently high discrimination while the requirement on difficulty involves falling within the proper range. When scores are obtained in terms of number of items correct, one may still compare distribution functions of θs and the number of items correct in order to obtain the appropriate analysis in terms of the θs parameter. There has been a substantial research on prediction of item discrimination and item difficulty from item attributes. Simple regression models and tree-based regression models have been applied to examine prediction of item difficulty from item attributes for TM several tests such as Praxis , GRE R , and NAEP reading (e.g., Sheehan & Mislevy, 1994; Sheehan, Kostin, & Persky, 2006; Wainer, Sheehan, & Wang, 1998). These studies show low to moderate amount of success in predicting item difficulty from item attributes. For example, Sheehan and Mislevy (1994) reported that item attributes explained between 20 and 40% of the variance in item difficulty and between 4 and 14% of the variance in item discrimination for 510 pretest items from a Praxis I R test that measures mathematics, reading, and writing; Sheehan et al. (2006) reported that item attributes explained between 14 and 50% of the variance in item difficulty for NAEP Reading. Nonetheless, it should be emphasized that the reported values were not examined by either cross-validation or by rigorous statistical analysis designed to adjust for the effects of selection bias. In this report, in addition to the IRT analysis, conventional regression analysis is performed to predict basic item statistics. 2 Application Data From a Scale Anchoring Study A scale anchoring study was recently performed using four forms of a teacher licensing test in mathematics. The least and largest possible scaled scores for the test are 150 and 190. The four anchor levels considered were 150 to 168, 169 to 173, 174 to 178, and 179 to 190. The score 169 is the least passing score among the states that use the test, 178 9 is the largest passing score, and 173 and 174 lie approximately midway between the least and largest passing scores. For the anchor level i, i = 2, 3, 4, an item anchored if: • At least 65% of examinees scoring in the range defined by the anchor level i answered the item correctly. • At most, 50% of examinees scoring in the range defined by the anchor level i − 1 answered the item correctly. Because the above criteria led to few items being anchored, items that meet a less stringent set of criteria were also identified. The criteria to identify items that almost anchored were the following: • At most, 60% of examinees scoring in the range defined by the anchor level i − 1 answered the item correctly. • The difference between the percentage of examinees in the range defined by anchor level i that answered the item correctly and the percentage of examinees in the range defined by anchor level i − 1 that answered the item correctly is at least 15%. To further supplement the pool of items, those that met only the criterion of at least 65% of the students answered correctly (regardless of the performance of examinees at the next lower level) were identified. The three categories of items, shown in Table 1, ensure that there were enough items available to inform the descriptions of examinee achievement at the anchor levels. The next step was the consensus component where the subject-area experts (that is, the test developers) reviewed the items that anchored and tried to interpret the results. The outcome of the scale anchoring procedure were statements such as that the examinees in Group 2 can (a) order positive integers, (b) follow simple directions (two steps or fewer), and so on. The participants of the consensus component of the study found the component to be quite tedious and they often struggled to come up with a meaningful list of skills at any anchor level. 10 Table 1 Number of Items That Anchored Anchor Anchored Almost level anchored 2 7 8 3 2 6 4 25 22 Total 34 36 Met the Total 65% criterion 14 29 13 21 10 57 37 107 Note. The total number of items in the four forms is 160. Results Test developers classify each item in the test into one of two classifications (referred to as IT) based on item type (pure or real) and one of five classifications (IC) based on item content (algebra, data analysis and probability, geometry, measurement, numbers and operations). These classifications, along with several other classifications, are used by the test developers to assemble test forms that conform to specifications. We had the IT and IC classifications available for all items in Forms 1 to 4. In addition, for only one of the four test forms (referred to as Form 1), we obtained a table that shows a list of 63 attributes (for example, one attribute is whether the item has a stimulus such as a table/figure or not) and the attributes (out of these 63) that apply to each item—the content experts created this table during the scale anchoring procedure. Results from the fitting of simple regression models. We fitted the 2PL model to data from Forms 1 to 4. Then, for each form, we used a simple linear regression model to predict the 40 estimated item difficulty parameters b̂i and the estimated item discrimination parameters âi from indicators of the IT and IC classifications. To avoid linear dependence of indicator variables, only five of the seven indicator variables plus a constant predictor can be employed. The regression model performed quite poorly. The F statistics provided no indication that any relationship between the dependent variables and the indicator variables existed. The squared multiple correlation coefficient R2 ranged between 0.05 and 0.16 for the model predicting estimated item difficulty, and between 11 0.03 and 0.30 for the model predicting estimated item discrimination. Similar results are obtained if, instead of the estimated difficulty and discrimination parameters, item proportions correct and item R-biserial correlations are used as the response variables in the regressions. Note that, if IT and IC classification have no effect on the dependent variable and if the dependent variable is normally distributed, then the R2 statistic has a mean of 5/39 = 0.13 and a standard deviation of (5/2)[(39 − 5)/2] (39/2)2 (1 + 39/2) 1/2 = 0.07, the probability is 0.95 that R2 is no greater than 0.27, and the probability is 0.99 that R2 is no greater than 0.35 (Rao, 1973, chapter 3). Thus no evidence exists that the IT and IC classifications are useful in predicting the four item statistics estimated item difficulty, estimated item discrimination, proportion correct, and R-biserial correlation. This conclusion reflects two considerations. An R2 of 0.3 or less does not indicate much ability to predict an item attribute. In addition, in view of the eight R2 statistics examined, the fact that the largest is about 0.30 provides no clear evidence that any relationship at all exists between item difficulty and item discrimination on the one hand and the IC and IT attributes on the other hand. For Form 1, we performed a stepwise linear regression (Draper & Smith, 1998, chapter 15) to predict the estimated item difficulty parameters and the estimated item discrimination parameters from the indicators of the 63 item attributes. The trivial indicator function with value 1 for all items was always included. Variables were added one by one to the model only if the F statistic for a variable was significant at the 0.15 level (the default value in SAS version 9.2 for stepwise linear regression). The same criterion was used for removal of variables. At first glance, the results might appear more promising than for the regressions on IT and IC classification. The algorithm picked six nontrivial attributes out of the possible 63 in predicting the estimated item difficulty parameters, and the resulting R2 statistic was 0.43. In the case of item slope parameters, eight nontrivial item attributes were chosen, and the resulting R2 was 0.64. Only one nontrivial item attribute was included in both the final model for item discrimination and the final model 12 for item difficulty. Nevertheless, cross-validation shows that the apparently high R2 values are a deceptive artifact of the fact that the stepwise regression procedure, when applied with a level of α, has an actual level that is much larger than α and tends to admit more predictors than is appropriate; see, for example, Draper and Smith (1998, pp. 342-343). To examine this issue, a cross-validation procedure was employed in which a series of stepwise regressions were employed in which one item i was removed. The regression without Item i was then used to obtain a prediction Ỹi of the value Yi of the dependent variable for Item i, where Yi is either âi or b̂i . The estimated mean-squared error was given by σ̃e2 =m −1 m X (Yi − Ỹi )2 . i=1 This mean-squared error was then compared to the estimated mean-squared error obtained from the same cross-validation procedure by prediction of Yi by the arithmetic mean Ȳi of the observations Yj , j 6= i. This mean-squared error is σ̃t2 =m −1 m X (Yi − Ȳi )2 = [m/(m − 1)]s2 , i=1 where s is the sample standard deviation of Yi , 1 ≤ i ≤ m (Haberman & Sinharay, 2008). The proportional reduction of mean-squared error from use of the stepwise regression rather than a constant predictor is then R̃2 = 1 − σ̃e2 /σ̃t2 . The observed values of R̃2 were −3.70 for item difficulty and −2.16 for item discrimination, so that the results of the stepwise regression could reasonably be regarded as much worse than useless. An alternative approach to stepwise regression can be adopted with a much stricter criterion for entry and removal of variables based on the Bonferroni inequality. To ensure that the probability is no greater than 0.15 that a variable will be entered at all if the dependent variable is independent of the independence variables and the dependent variable has a normal distribution, one requires a significance level of 0.15/63 = 0.00238 (Draper & Smith, 1998, p. 142). When a level of 0.00238 was used, no indicators of item attributes were entered at all for either item discrimination or item difficulty. Note that 13 were tree regression applied in this example and were the Bonferroni approach used, it is also true that no variables would be entered, so that no tree construction would occur. Criteria for tree branching comparable to those for stepwise regression would encounter the same problems of cross-validation found with stepwise regression. Results from the fitting of IRT models. We fitted Models MI , M0 , M1 , M2 , ML , MC , and MA to Forms 1 to 4. Results for MA are essentially the same as for MC , so that they are not reported. In the case of the LLTM (ML ) and the constrained 2PL model (MC ), a model based on the six linearly independent IC and IT indicators was employed for all four forms. In addition, for Form 1, models ML and MC were applied with 14 indicator functions. One indicator was 1 for all items, and the other indicator functions were those used in the final model from either the stepwise regression for item difficulty or the stepwise regression for item discrimination. Table 2 shows the values of MEELPPI for Form 1. Each row corresponds to a model. The table shows, for each model, the following quantities: • The number of parameters • MEELPPI • The correlation between the proportion correct p+ and the estimated difficulty from the model (denoted as Cor(b̂, p+) in the table) • (For only the LLTM and constrained 2PL model.) The correlation between the estimated difficulty from the model and the estimated difficulty from the corresponding unrestricted model (which is the Rasch model for the LLTM and the 2PL model for the constrained 2PL model). The correlation is denoted as Cor(b̂, b̂). • The correlation between the item R-biserial coefficient Rbis and the estimated discrimination from the model (Cor(â, Rbis )) • (For only the constrained 2PL model.) The correlation between the estimated discrimination from the model and the estimated discrimination from the 2PL model (Cor(â, â2 )) 14 Table 2 Minimum Estimated Expected Log Penalty per Item (MEELPPI) for the Different Models for Form 1 Model 0PL Independence Rasch LLTM-IT&IC LLTM-Stepwise 2PL 2PL-C-IT&IC 2PL-C-Stepwise Number of MEELPPI Cor(b̂, p+) Cor(b̂, b̂) parameters 2 0.6328 40 0.5927 41 0.5366 -0.98 7 0.6267 -0.26 0.26 15 0.5851 -0.73 0.70 80 0.5313 -0.99 12 0.6256 -0.25 0.26 28 0.5746 -0.50 0.49 Cor(â, Rbis ) Cor(â, â2 ) 0.92 0.17 0.35 0.04 0.33 It should be noted that regression results on item intercepts and item difficulties are comparable, so that the results for item intercepts and for the alternative constrained 2PL model are not reported. Interpretation of Table 2 is straightforward, except for the models based on item attributes from stepwise regression. The 2PL model is a bit more successful than is 2 the Rasch model, but the difference is small. The R012 statistic is 0.95, so that the preponderance of the improvement in MEELPPI from the 0PL to the 2PL model is obtained from the transition from the 0PL to the Rasch model. This result and the observed differences in MEELPPI are relatively common in educational tests (Haberman, 2005, 2007). Note that d2 − d1 = d1 − d0 = 39, so that the improvement in MEELPPI per independent parameter is ν01 = 0.0025 for the comparison of the 0PL and Rasch models and ν12 = 0.0001 for the comparison of the Rasch and 2PL models. The LLTM based on the IC 2 and IT attributes is relatively unsuccessful. The R0L1 statistic is only 0.06, so that relatively little of the improvement from the 0PL to the Rasch model is explained by the LLTM. In addition, ν0L = 0.0012 is a somewhat smaller improvement of MEELPPI per independent parameter for the comparison of the 0PL model and LLTM than the corresponding value νL1 = 0.0027 from comparison of the LLTM to the Rasch model. Similar comments apply to the constrained 2PL model based on the IC and IT classifications. A notable feature is 15 that the LLTM and constrained 2PL model are both less successful than the independence model. The LLTM based on the item attributes from stepwise regression is not very successful, but it appears more successful than the LLTM based on the IC and IT 2 attributes. The R0L1 statistic is 0.50, ν0L = 0.0037, and νL1 = 0.0019, so that the LLTM is substantially less effective than the full Rasch model, but it does reduce MEELPPI per independent parameter compared to the 0PL model somewhat better than in the case of the LLTM based on IC and IT. Results for the constrained 2PL case are somewhat similar. Nonetheless, a substantial selection bias is involved due to the choice of item attributes by stepwise regression. To check this issue, 20 additional LLTMs were considered in which one item attribute indicator was 1 for each item and 13 item attribute indicators were selected at random from the 63 available indicators for item attributes. The additional restriction was imposed that the number of independent parameters be 14. For each combination of 13 nontrivial indicators, the MEELPPI was computed along with the R2 statistics for prediction of item difficulty from the indicator variables. The sample mean of the MEELPPI statistics was 0.6101, and the sample standard deviation was 0.0063. The smallest MEELPPI observed from the 20 additional models was 0.5966, and the 2 was 0.38, so that the results of stepwise regression were a bit corresponding value of R0L1 better than those typically derived by a random use of a comparable number of indicator variables for item attributes. On the other hand, some reason still exists for concern about the reality of even the modest result achieved from the stepwise regression. A regression of MEELPPI on R2 for the 20 models yields an estimated regression line of 0.6305 − 0.0820R2 for estimation of MEELPPI. The corresponding coefficient of determination is 0.88. The R2 for the 13 nontrivial item attributes from stepwise regression is 0.52, so that the regression predicts an MEELPPI of 0.5880, a close approximation to the observed 0.5851. Thus it is quite plausible that the results based on stepwise regression merely reflect the tendency of the stepwise regression procedure to admit more predictors than is appropriate. Similar remarks also apply to the constrained 2PL case. Table 3 provides an analysis for Form 2 that is quite comparable to the analysis for 16 Form 1, except that the 63 item attributes were not available. The results for Forms 3 and 4 are similar to those for Form 2 and are not shown here. The MEELPPI for the LLTM and the constrained 2PL model are much larger than that for the Rasch and 2PL models, respectively, and are even larger than that for the independence model. These results, like the regression results above, show that IT and IC classifications are poor predictors of either item difficulty or item discrimination. Table 3 Minimum Estimated Expected Log Penalty per Item (MEELPPI) for the Different Models for Form 2 Model 0PL Independence Rasch LLTM-IT&IC 2PL 2PL-C-IT&IC Number of MEELPPI Cor(b̂, p+) Cor(b̂,b̂) parameters 2 0.6282 40 0.5741 41 0.5175 -0.98 7 0.6161 -0.34 0.33 80 0.5137 -0.98 12 0.6159 -0.34 0.36 Cor(â, Rbis ) Cor(â,â2 ) 0.86 0.03 0.00 It is reasonable to conclude that the available item attributes for the four forms provide no basis for scale anchoring. It is no wonder then that the consensus component of the scale anchoring process was found tedious by the participants. Conclusions This paper describes a set of simple statistical and psychometrics techniques that can be used to examine if a scale anchoring study will come up with useful information. The techniques involve fitting of simple linear regression and IRT models to examine whether appropriate item attributes can predict the item difficulty and item discrimination. The application of the techniques to four forms of a teacher licensing examination show that the item attributes do not predict the item difficulty and item discrimination adequately for these data. So scale anchoring is not expected to provide much useful information to the 17 examinees for this test. The discouraging results for the example considered do not necessarily imply that the same results will always be observed, but they certainly indicate that success in scale anchoring is far from guaranteed. Presumably the adequacy of the list of item attributes possessed by the items is a key to the set of the techniques suggested. Such a list can be found in the test blueprint used by the test developers to build test forms, or such a list can be produced from scale anchoring another form of the same test or a similar test. It is possible that our suggested techniques performed with a set of available attributes show that a scale anchoring study will fail to elicit useful information, but, later, in a scale anchoring study, the content experts come up with a different list of item attributes to describe the anchor levels. However, in our opinion, this situation will mostly occur for tests in which the test construction process is not very rigorous, so that test forms are created without careful attention to item attributes. Note that if a testing program intends to report PLDs, several researchers such as Bejar, Braun, and Tannenbaum (2007) have argued that the descriptors should be written early in the test development process and be used in developing test blueprints and item specifications. If that is done, the methodology suggested in this paper can be used in the initial stages of a test construction, probably after a trial administration and before an operational administration. Attempts to report PLDs from a test which was not built to do so usually will not result in much useful information. A further issue is the importance of sample size. Statistical procedures are far more likely to lead to satisfactory results with larger collections of items. Longer tests are thus more attractive targets. In addition, it is reasonable to consider multiple forms, although such a study has to ensure that the item difficulty and item discrimination parameters of the different forms are comparable to each other. 18 References Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of Educational Statistics, 17, 191–204. Bejar, I. I., Braun, H., & Tannenbaum, R. (2007). A prospective, predictive and progressive approach to standard setting. In R. Lissitz (Ed.), Assessing and modeling cognitive development in school: Intellectual growth and standard setting (pp. 1–30). Maple Grove, MN: Jam Press. Carroll, J. B. (1993). Test theory and the behavioral scaling of test performance. In N. Fredericksen, R. J. Mislevy, & I. Bejar (Eds.), Test theory for a new generation of tests (pp. 297–322). Hillsdale, NJ: Lawrence Erlbaum. Draper, N. R., & Smith, H. (1998). Applied regression analysis (3rd ed.). New York, NY: John Wiley. Embretson, S. E. (1993). Generating items during testing: Psychometric issues and models. Psychometrika, 64, 407–433. Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374. Garcia Gomez, P., Noah, A., Schedl, M., Wright, C., & Yolkut, A. (2007). Proficiency descriptors based on a scale-anchoring study of the new TOEFL iBT reading test. Language Testing, 24, 417–444. Gilula, Z., & Haberman, S. J. (2001). Analysis of categorical response profiles by informative summaries. Sociological Methodology, 31, 129–187. Haberman, S. J. (2005). Latent-class item response models (ETS Research Rep. No. RR-0528). Princeton, NJ: ETS. Haberman, S. J. (2006). An elementary test of the normal 2PL model against the normal 3PL alternative (ETS Research Rep. No. RR-06-14). Princeton, NJ: ETS. Haberman, S. J. (2007). The information a test provides on an ability parameter (ETS Research Rep. No. RR-07-18). Princeton, NJ: ETS. Haberman, S. J., & Sinharay, S. (2008). Sample size requirements for automated essay scoring (ETS Research Rep. No. RR-08-32). Princeton, NJ: ETS. 19 Kelley, D. L. (2002). Application of the scale anchoring method to interpret the TIMSS scales. In D. F. Robitalle & A. E. Beaton (Eds.), Secondary analysis of the TIMSS data (pp. 375–390). Amsterdam, the Netherlands: Springer-Verlag. Linn, R. L., & Dunbar, S. (1992). Issues in the design and reporting of the national assessment of educational progress. Journal of Educational Measurement, 29, 177–194. Perie, M. (2008). A guide to understanding and developing performance-level descriptors. Educational Measurement: Issues and Practice, 27, 15–29. Phillips, G. W., Mullis, I. V. S., Bourque, M. L., Williams, P. L., Hambleton, R. K., Owen, E. H., & Barton, P. E. (1993). Interpreting NAEP scales (NCES 93421). Washington, DC: National Center for Education Statistics, US Department of Education. Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New York, NY: John Wiley. Sheehan, K., Kostin, I., & Persky, H. (2006, April). Predicting item difficulty as a function of inferential processing requirements: An examination of the reading skills underlying performances on the NAEP Grade 8 reading assessment. Paper presented at the annual meeting of the National Council of Measurement in Education, San Francisco, CA. Sheehan, K., & Mislevy, R. J. (1994). A tree-based analysis of items from an assessment of basic mathematics skills (ETS Research Rep. No. RR-94-14). Princeton, NJ: ETS. Sinharay, S., & Haberman, S. J. (2008). How much can we reliably know about what examinees know? Measurement, 6, 46–49. Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26, 21–28. Wainer, H., Sheehan, K., & Wang, X. (1998). Some paths toward making praxis scores more useful (ETS Research Rep. No. RR-98-44). Princeton, NJ: ETS. 20

RELATED PAPERS

RELATED TOPICS

Log In

Statistical Procedures to Evaluate Quality of Scale Anchoring

Statistical Procedures to Evaluate Quality of Scale Anchoring

Related Papers

RELATED PAPERS

RELATED TOPICS