Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Hongli Li
  • CEHD Rm 456
    Georgia State University
    Atlanta, GA., USA.
The purpose of this study is to review the status of differential item functioning (DIF) research in language testing, particularly as it relates to the investigation of sources (or causes) of DIF, which is a defining characteristic of... more
The purpose of this study is to review the status of differential item functioning (DIF) research in language testing, particularly as it relates to the investigation of sources (or causes) of DIF, which is a defining characteristic of the third generation DIF. This review included 110 DIF studies of language tests dated from 1985 to 2019. We found that DIF researchers did not address sources of DIF more frequently in recent years than in earlier years. Nevertheless, DIF research in language testing has expanded with new DIF analysis procedures, more grouping variables, and more diversified methods for investigating sources of DIF. In addition, in the early years of DIF research, methods to identify sources of DIF relied heavily on content analysis. This review showed that while more sophisticated statistical procedures have been adopted in recent years to address sources of DIF, understanding sources of DIF still remains a challenging task. We also discuss the pros and cons of existing methods to detect sources of DIF and implications for future investigations.
Peer assessment is increasingly being used as a pedagogical tool in classrooms. Participating in peer assessment may enhance student learning in both cognitive and non-cognitive aspects. In this study, we focused on noncognitive aspects... more
Peer assessment is increasingly being used as a pedagogical tool in classrooms. Participating in peer assessment may enhance student learning in both cognitive and non-cognitive aspects. In this study, we focused on noncognitive aspects by performing a meta-analysis to synthesize the effect of peer assessment on students’ non-cognitive learning outcomes. After a systematic search, we included 43 effect sizes from 19 studies, which mostly involved learning strategies and academic mind-sets as non-cognitive outcomes. Using a random effects model, we found that students who had participated in peer assessment showed a 0.289 standard deviation unit improvement in non-cognitive outcomes as compared to students who had not participated in peer assessment. Further, we found that the effect of peer assessment on non-cognitive outcomes was significantly larger when both scores and comments were provided to students or when assessors and assessees were matched at random. Our findings can be used as a basis for
further investigation into how best to use peer assessment as a learning tool, especially to promote non-cognitive development.
Background: The aim of this study was to investigate the relationship between executive function (EF), stuttering, and comorbidity by examining children who stutter (CWS) and children who do not stutter (CWNS) with and without comorbid... more
Background: The aim of this study was to investigate the relationship between executive function (EF), stuttering, and comorbidity by examining children who stutter (CWS) and children who do not stutter (CWNS) with and without comorbid conditions. Data from the National Health Interview Survey were used to examine behavioral manifestations of EF, such as inattention and self-regulation, in CWS and CWNS. Methods: The sample included 2258 CWS (girls = 638, boys = 1620), and 117,725 CWNS (girls = 57,512; boys = 60,213). EF, and the presence of stuttering and comorbid conditions were based on parent report. Descriptive statistics were used to describe the distribution of stuttering and comorbidity across group and sex. Regression analyses were to determine the effects of stuttering and comorbidity on EF, and the relationship between EF and socioemotional competence. Results: Results point to weaker EF in CWS compared to CWNS. Also, having comorbid conditions was also associated with weaker EF. CWS with comorbidity showed the weakest EF compared to CWNS with and without comorbid-ity, and CWS without comorbidity. Children with stronger EF showed higher socioemotional competence. A majority (60.32%) of CWS had at least one other comorbid condition in addition to stuttering. Boys who stutter were more likely to have comorbid conditions compared to girls who stutter. Conclusion: Present findings suggest that comorbidity is a common feature in CWS. Stuttering and comorbid conditions negatively impact EF.
With online course delivery on the rise, it is essential to understand the preparedness of students attending traditional universities. Prior research has found that some students struggle in online courses, which leads to a quest to... more
With online course delivery on the rise, it is essential to understand the preparedness of students attending traditional universities. Prior research has found that some students struggle in online courses, which leads to a quest to better understand the reason why. Studies of self-regulated learning (SRL) in online and blended courses have added to our understanding. However, few studies have used a person-centered approach to study profiles of SRL in fully online courses, and none with a population of students attending a traditional university. This is of importance, especially at a time when traditional universities are increasingly providing online courses. To address the gaps in previous SRL profile research, the current study examined individual differences in SRL profiles of 477 students attending online courses at a traditional university setting, using the Online Self-regulated Learning Questionnaire (OSLQ). Using latent profile analysis, we found four different profiles, with a majority of the students falling in groups representing lower levels of SRL skills. We also explored the possible relationship of experience in online learning, online comfort, age, and gender with the identified self-regulated learning profiles. Relationships were found between the profiles and comfort level as well as with gender.
Despite increasing pressure for children to learn to write at younger ages, there are many unanswered questions about composition skills in early elementary school. The goal of this research was to examine the dimensionality of... more
Despite increasing pressure for children to learn to write at younger ages, there are many unanswered questions about composition skills in early elementary school. The goal of this research was to examine the dimensionality of composition skills in kindergarten children, thereby adding to current knowledge about the measurement of young children’s writing and its component skills. The writing of 282 kindergarten children were assessed using three different scoring methods. Confirmatory factor analyses were used to investigate the dimensionality of various methods of scoring. Results indicated that a qualitative scoring system and a productivity scoring system capture distinct dimensions of kindergartners’ compositions. A scoring system for curriculum-based measurement could not attain acceptable fit, which may suggest that CBM is ill-suited for capturing the important components of composition for kindergartners. This study indicated that the measurement and components of composition in kindergarten may be qualitatively different from the compositions of older children.
A major challenge in research with struggling adult readers is their heterogeneity in reading-related competencies and demographic characteristics. The purpose of this investigation was to identify unique profiles of skill sets among... more
A major challenge in research with struggling adult readers is their heterogeneity in reading-related competencies and demographic characteristics. The purpose of this investigation was to identify unique profiles of skill sets among struggling adult readers and explore informative demographic differences between profiles. Using latent class analysis with a sample of 542 struggling adult readers, we uncovered four empirically distinct classes of readers based on their performance on ten assessments of lower-level and higher-level competencies. On all measured competencies, globally impaired readers (n = 123) demonstrated the largest deficits and globally better readers (n = 86) outperformed all other classes. Two intermediate profiles, weak decoders (n = 144) and weak language comprehenders (n = 189), exhibited complementary patterns of strengths and weaknesses on lower-level and higher-level competencies. One-way ANOVA and chi-square tests of difference indicated that the classes differed significantly in terms of reading comprehension performance, age, and language background but not high school completion. Implications for instruction and future research are discussed.
Researchers have been interested in classifying massive open online course (MOOC) students based on their learning behaviors. However, less attention has been paid to the cognitive attributes associated with various learning behaviors. In... more
Researchers have been interested in classifying massive open online course (MOOC) students based on their learning behaviors. However, less attention has been paid to the cognitive attributes associated with various learning behaviors. In this study, we propose a conceptual model that links MOOC students’ observable learning behaviors to their latent attributes (i.e., individual learning versus interactive learning). Using students’ behavior data from a MOOC, we performed a cognitive diagnostic analysis to identify the students’ learning profiles and to determine how these profiles related to their course achievement. We found that a large portion of the students performed individual learning whereas only a very small portion of them overtly performed interactive learning. In addition, the students who performed interactive learning were more likely to pass the course with distinction than the students who did not show this attribute. The results of this study have important implications for improving students’ learning in MOOCs. Further, the study provides a good demonstration of how to use clickstream process data for psychometric analysis.
In recent years, there have been an increasing use of peer assessment in classrooms and other learning settings. Despite the prevailing view that peer assessment has a positive effect on learning, across empirical studies the results... more
In recent years, there have been an increasing use of peer assessment in classrooms and other learning settings. Despite the prevailing view that peer assessment has a positive effect on learning, across empirical studies the results reported are mixed. In this meta-analysis, we synthesized findings based on 134 effect sizes from 58 studies. Compared to students who do not participate in peer assessment, those who participate in peer assessment show a .291 standard deviation unit increase in their performance. Further, we performed a meta-regression analysis to examine the factors that are likely to influence the peer assessment effect. The most critical factor is rater training. When students receive rater training, the effect size of peer assessment is substantially larger than when students do not receive such training. Computer-mediated peer assessment is also associated with greater learning gains than is paper-based peer assessment. A few other variables also show noticeable, although not statistically significant, effects. The results of the meta-analysis can be considered by researchers and teachers as a basis for determining how to make effective use of peer assessment as a learning tool.
The Classroom Assessment Scoring System (CLASS) has been used extensively to measure teacher-student interactions and classroom quality. With a theoretical foundation rooted in the developmental theory of learning, CLASS has three primary... more
The Classroom Assessment Scoring System (CLASS) has been used extensively to measure teacher-student interactions and classroom quality. With a theoretical foundation rooted in the developmental theory of learning, CLASS has three primary domains—Emotional Support, Classroom Organization, and Instructional Support. In this study, we performed a meta-analysis of the factor structure of CLASS using Cheung’s two-stage structural equation modeling (TSSEM) approach. After searching the literature, we obtained 26 correlation matrices of the 10 dimensions shared by multiple versions of CLASS. This meta-analysis supports the three-factor model initially proposed by CLASS developers. The finding of this meta-analysis provides important evidence pertinent to the CLASS factor structure and has significant implications regarding the interpretation and use of CLASS scores.
This study explored the relations between reading comprehension and two memory capacities, short‐term memory (STM) and working memory (WM), for adults who read between the third and eighth grade levels. With a sample of 407 adults from... more
This study explored the relations between reading comprehension and two memory capacities, short‐term memory (STM) and working memory (WM), for adults who read between the third and eighth grade levels. With a sample of 407 adults from two countries, we computed correlations among measures and conducted hierarchical regression and commonality analyses for reading comprehension. Reading comprehension had moderate positive correlations with STM and WM. Additionally, STM and WM jointly accounted for approximately 19% of the reading comprehension variance and uniquely contributed approximately 4% and 7% of the variance, respectively. The predictive utility of memory to reading comprehension was greatly reduced after controlling for age, word reading, fluency and oral vocabulary. WM appears to be a slightly stronger predictor of reading comprehension than STM for struggling adult readers. However, the overall contributions of memory capacities to reading comprehension are much smaller than those of reading‐related skills.
The role of measuring functional impairment holds an important place in research, clinical practice, and service provision for children and adolescents. Responding to the growing need to measure serious emotional disturbances at the... more
The role of measuring functional impairment holds an important place in research, clinical practice, and service provision for children and adolescents. Responding to the growing need to measure serious emotional disturbances at the local, state, and national level, the Columbia Impairment Scale (CIS) was developed in the early 1990s and has remained one of the several popular scales for assessing functional impairment. However, despite the growing popularity of the instrument in research and practice, only a few studies to date have specifically examined the psychometric properties of the CIS. In this article, we describe the results of the first item response theory analysis of the CIS utilizing nationally representative data from the Medical Expenditure Panel Survey (N = 69,966). The results of our analysis lend support to the essential unidimensionality of the CIS and demonstrate that the scale is most reliable for those who exhibit high levels of functional impairment. Given the psychometric properties of the scale identified by our analysis, we contend that the CIS is a viable measure in the ongoing efforts to establish a national epidemiologic surveillance system to track the prevalence and impact of serious emotional disturbances in children and adolescents.
Research Interests:
Classroom observations have been increasingly used for teacher evaluations, and it is important to examine the measurement quality and the use of observation ratings. When a teacher is observed in multiple classrooms, his or her... more
Classroom observations have been increasingly used for teacher evaluations, and it is important to examine the measurement quality and the use of observation ratings. When a teacher is observed in multiple classrooms, his or her observation ratings may vary across classrooms. In that case, using ratings from one classroom per teacher may not be adequate to represent a teacher’s instructional quality. Drawing on the Measures of Effective Teaching (MET) dataset, this study examined the variation of a teacher’s classroom observation ratings across his or her multiple classrooms. The results indicate that the math classrooms accounted for 4.9% to 14.7% of the variance in the classroom observation ratings and English language arts (ELA) classrooms accounted for 6.7% to 15.5% of the variance in the ratings. The results of this study suggest that teachers’ multiple classrooms should be taken into consideration when classroom observation ratings are used to evaluate teachers in high-stakes settings.
In recent years, students’ test scores have been used to evaluate teachers’ performance. The assumption underlying this practice is that students’ test performance reflects teachers’ instruction. However, this assumption is generally not... more
In recent years, students’ test scores have been used to evaluate teachers’ performance. The assumption underlying this practice is that students’ test performance reflects teachers’ instruction. However, this assumption is generally not empirically tested. In this study, we attempted to examine the effect of teachers’ instruction on test performance at the item level. Specifically, using the U.S. TIMSS 2011 4th-grade math assessment data, we examined the instructional sensitivity of the items using a hierarchical differential item functioning (DIF) approach. Specifically, we tested whether students who had received instruction on a given item showed significantly better performance on the item than students who had not received such instruction when their overall math ability was controlled for, whether with or without controlling for student-level and class-level covariates. The study provided preliminary findings regarding why some items showed instructional sensitivity and shed some light on how to develop instructionally sensitive items. Implications and directions for further research were also discussed.
In applications of cognitive diagnostic models (CDMs), practitioners usually face the difficulty of choosing appropriate CDMs and building accurate Q-matrices. However, functions of model-fit indices that are supposed to inform model and... more
In applications of cognitive diagnostic models (CDMs), practitioners usually face the difficulty of choosing appropriate CDMs and building accurate Q-matrices. However, functions of model-fit indices that are supposed to inform model and Q-matrix choices are not well understood. This study examines the performance of several promising model fit indices in selecting model and Q-matrix under different sample size conditions. Relative performance between AIC and BIC in model and Q-matrix selection appears to depend on the complexity of data generating models, Q-matrices, and sample sizes. Among the absolute fit indices, MX2 is least sensitive to sample size under correct model and Q-matrix specifications and performs the best in power. Sample size is found to be the most influential factor on model fit index values. Consequences of selecting inaccurate model and Q-matrix in classification accuracy of attribute mastery are also evaluated.
Drawing on the PISA 2009 US dataset, this study examines the relationship between formative assessment and students’ reading achievement using a structural equation modeling approach. We find that formative assessment is positively... more
Drawing on the PISA 2009 US dataset, this study examines the relationship between formative assessment and students’ reading achievement using a structural equation modeling approach. We find that formative assessment is positively related to students’ reading achievement directly and indirectly (through teacher–student relationship and attitude toward reading) for all students. The direct relationship between formative assessment and reading achievement is significantly stronger for Black students than for White students, whether or not student SES, gender, and school mean SES are controlled for. The total relationship (the direct plus the indirect relationship) between formative assessment and reading achievement also appears to be stronger for Black students than for White students; however, the difference is not statistically significant whether or not we control for covariates. No significant difference is found between White and Hispanic students in terms of the direct and the total relationship between formative assessment and reading achievement. Using a nationally representative dataset, this study provides empirical evidence that formative assessment is positively related to students’ reading achievement in general. In addition, this study provides preliminary evidence to show the potential of formative assessment to help reduce achievement gaps between Black and White students. The implications and limitations of the study are also discussed.
Cognitive diagnostic models (CDMs) have great promise for providing diagnostic information to aid learning and instruction, and a large number of CDMs have been proposed. However, the assumptions and performances of different CDMs and... more
Cognitive diagnostic models (CDMs) have great promise for providing diagnostic information to aid learning and instruction, and a large number of CDMs have been proposed. However, the assumptions and performances of different CDMs and their applications in regard to reading comprehension tests are not fully understood. In the present study, we compared the performance of a saturated model (G-DINA), two compensatory models (DINO, ACDM), and two non-compensatory models (DINA, RRUM) with the Michigan English Language Assessment Battery (MELAB) reading test. Compared to the saturated G-DINA model, the ACDM showed comparable model fit and similar skill classification results. The RRUM was slightly worse than the ACDM and G-DINA in terms of model fit and classification results, whereas the more restrictive DINA and DINO performed much worse than the other three models. The findings of this study highlighted the process and considerations pertinent to model selection in applications of CDMs with reading tests.
Given the wide use of peer assessment, especially in higher education, the relative accuracy of peer ratings compared to teacher ratings is a major concern for both educators and researchers. This concern has grown with the increase of... more
Given the wide use of peer assessment, especially in higher education, the relative accuracy of peer ratings compared to teacher ratings is a major concern for both educators and researchers. This concern has grown with the increase of peer assessment in digital platforms. In this meta-analysis, using a variance-known hierarchical linear modeling approach, we synthesize findings from studies on peer assessment since 1999 when computer-assisted peer assessment started to proliferate. The estimated average Pearson correlation between peer and teacher ratings is found to be .63, which is moderately strong. This correlation is significantly higher when (a) the peer assessment is paper-based rather than computer-assisted; (b) the subject area is not medical/clinical; (c) the course is graduate level rather than undergraduate or K-12; (d) individual work instead of group work is assessed; (e) the assessors and assessees are matched at random; (f) the peer assessment is voluntary instead of compulsory; (g) the peer assessment is non-anonymous; (h) peer raters provide both scores and qualitative comments instead of only scores; and (i) peer raters are involved in developing the rating criteria. The findings are expected to inform practitioners regarding peer assessment practices that are more likely to exhibit better agreement with teacher assessment.
In this study, we examined relationships between the use of test results and U.S. students’ math, reading, and science performance in Programme for International Student Assessment (PISA) 2009. Based on a literature review, we... more
In this study, we examined relationships between the use of test results and U.S. students’ math, reading, and science performance in Programme for International Student Assessment (PISA) 2009. Based on a literature review, we hypothesized that the 16 items in the PISA school questionnaire, which are related to the use of test results, can be categorized according to four factors. We validated this hypothesized factor structure using a confirmatory factor analysis and then obtained composite scores for each factor. As revealed by a multilevel analysis, when student and school demographic variables were controlled for, using test results to hold schools accountable to authority and the public was significantly positively related to students’ performance across all three subjects. No statistically significant relationship, however, was detected between students’ performance and the following uses of test scores: informing parents of their children’s performance, providing information for instructional purposes, and evaluating teachers and principals.
Read-aloud accommodations have been proposed as a way to help remove barriers faced by students with disabilities in reading comprehension. Many empirical studies have examined the effects of read-aloud accommodations; however, the... more
Read-aloud accommodations have been proposed as a way to help remove barriers faced by students with disabilities in reading comprehension. Many empirical studies have examined the effects of read-aloud accommodations; however, the results are mixed. With a variance-known hierarchical linear modeling approach, based on 114 effect sizes from 23 studies, a meta-analysis was conducted to examine the effects of read-aloud accommodations for students with and without disabilities. In general, both students with disabilities and students without disabilities benefited from the read-aloud accommodations, and the accommodation effect size for students with disabilities was significantly larger than the effect size for students without disabilities. Further, this meta-analysis reveals important factors that influence the effects of read-aloud accommodations. For instance, the accommodation effect was significantly stronger when the subject area was reading than when the subject area was math. The effect of read-aloud accommodations was also significantly stronger when the test was read by human proctors than when it was read by video/audio players or computers. Finally, the implications, limitations, and directions for future research are discussed.
Test accommodations have been proposed to help overcome the unfair challenges faced by English Language Learners (ELLs) due to their relatively low English proficiency. A test accommodation is regarded as effective when it improves the... more
Test accommodations have been proposed to help overcome the unfair challenges faced by English Language Learners (ELLs) due to their relatively low English proficiency. A test accommodation is regarded as effective when it improves the test performance of ELLs. However, this improvement raises the question of whether such accommodations give ELLs an unfair advantage. One criterion used in determining a test accommodation’s fairness is that it should only remove the disadvantage that ELLs face in regard to their low language proficiency, without giving ELLs any additional advantages. This criterion is met when the test accommodation does not improve the test performance of the non-ELLs when the same accommodation is applied to them. To determine the fairness and, thus, the validity of test accommodations for ELLs, a meta-analysis using hierarchical linear modeling was conducted to compare the effects of test accommodations on the test performance of ELLs and on that of non-ELLs. The results indicated that test accommodations improved ELLs’ test performance by about 0.156 standard deviation units but did not discernibly influence the test performance of non-ELLs. This meta-analysis, therefore, constitutes evidence to support the fairness and validity of providing test accommodations for ELLs.
A meta-analysis using Hierarchical Linear Modeling (HLM) was conducted to examine the effects of test accommodations on the test performance of English language learners (ELLs). The results indicated that test accommodations improve ELLs'... more
A meta-analysis using Hierarchical Linear Modeling (HLM) was conducted to examine the effects of test accommodations on the test performance of English language learners (ELLs). The results indicated that test accommodations improve ELLs' test performance by about 0.157 standard deviations—a relatively small but statistically significant increase. Once the potential predictors that may have contributed to the variance of the effect sizes across studies had been accounted for, only English proficiency was found to be significant. Further, the results indicated that ELLs with a low level of English proficiency benefited much more from test accommodations than did those with a high level of English proficiency. Little difference was observed in regard to other factors such as students' ethnicity, students' grade level, or the subject for which they were being examined. Although previous studies have suggested that linguistic simplification may be more effective than other methods, results from this meta-analysis offered no support for that suggestion.
As engagement with science, technology, engineering, and mathematics (STEM) increases in after-school programs (ASPs), it is important to examine the impact of this engagement on students' academic achievement, STEM participation, and... more
As engagement with science, technology, engineering, and mathematics (STEM) increases in after-school programs (ASPs), it is important to examine the impact of this engagement on students' academic achievement, STEM participation, and affinity toward STEM. Results of these examinations can offer insights into both best practices that could be replicated and possible poor practices that could be avoided in ASP sites. This study describes the validation process that was undertaken on an instrument developed to measure science-related attitudes, and education and career trajectories of students participating in a STEM-focused ASP. We then use the validated instrument to draw certain conclusions about the impact of the ASP program on the participants. We propose a model for predicting students' notions about the importance of science for their future and a model for predicting students' enactment of science agency. The study and the derived instrument may be useful for those interested in examining the impact of STEM-focused ASPs on students' attitudes and proclivities toward science.
Differential skill functioning (DSF) exists when examinees from different groups have different probabilities of successful performance in a certain subskill underlying the measured construct, given that they have the same ability on the... more
Differential skill functioning (DSF) exists when examinees from different groups have different probabilities of successful performance in a certain subskill underlying the measured construct, given that they have the same ability on the overall construct. Using a DSF approach, this study examined the differences between two native language groups – a group with an East Asian language background and one with a Romance language background – in regard to reading subskills as represented in the Michigan English Language Assessment Battery (MELAB) reading test. Based on a combination of literature review and think-aloud reports from a sample of ESL students, hypotheses on reading subskill differences between the two groups were generated. These hypotheses were tested by first identifying the subskill profile of each examinee in a large MELAB database via the application of a previously determined item-skill Q-matrix to a Fusion Model of cognitive diagnostic modeling. The subskill profiles of the East Asian examinees were then compared against those of examinees with a Romance language background through logistic regression techniques. Some important DSFs were found between the two groups. Based on results of this study, instructional strategies were suggested to address some specific weaknesses in ESL learners’ reading subskills.
With cognitive diagnostic analysis, each examinee receives a multidimensional skill profile expressing whether he/she is a master or nonmaster of each skill measured by the test. Fine-grained diagnostic feedback that facilitates teaching... more
With cognitive diagnostic analysis, each examinee receives a
multidimensional skill profile expressing whether he/she is a master or nonmaster of each skill measured by the test. Fine-grained diagnostic feedback that facilitates teaching and learning can thus be provided to teachers and students. This study investigated cognitive diagnostic analysis as applied to the Michigan English Language Assessment Battery (MELAB) reading test. The Fusion Model (Hartz, 2002) was used to estimate examinee profiles on each reading subskill underlying the MELAB reading test. With data collected from multiple sources, such as the think-aloud protocol and expert rating, a tentative Q-matrix was initially developed to indicate the subskills required by each item. This Q-matrix was then validated via an application of the Fusion Model using
data from the MELAB reading test. Four subskills were found to underlie the test, e.g., vocabulary, syntax, extracting explicit information, and understanding implicit information. Examinee skill mastery profiles were produced as the result of the cognitive diagnostic analysis. Finally, issues involved in the cognitive diagnostic analysis of reading tests were discussed,
and areas for future research were also suggested.
Cognitive diagnostic analyses have been advocated as methods that allow an assessment to function as a formative assessment to inform instruction. To use this approach, it is necessary to first identify the skills required for each item... more
Cognitive diagnostic analyses have been advocated as methods that allow an assessment to function as a formative assessment to inform instruction. To use this approach, it is necessary to first identify the skills required for each item in the test, known as a Q-matrix. However, because the construct being tested and the underlying cognitive processes associated with it are usually not fully understood, establishing a Q-matrix, especially for an existing test, is a challenging task. This study reports the process of constructing and validating a Q-matrix for the reading comprehension section of the Michigan English Language Assessment Battery (MELAB). An initial Q-matrix was first generated based on evidence gathered from related literature, students’ think-aloud protocols, and expert ratings. This initial Q-matrix was then validated empirically by applying the fusion model to a large MELAB data set. A well-supported Q-matrix was produced for potential future diagnostic applications.
Based on different language systems and educational practices of their respective countries, hypotheses were made regarding how 15-year-old students from Shanghai-China and the US might differ in the 5 reading subskills designated in the... more
Based on different language systems and educational practices of their respective countries, hypotheses were made regarding how 15-year-old students from Shanghai-China and the US might differ in the 5 reading subskills designated in the Programme for International Student Assessment (PISA) when they have the same overall reading ability (i.e., when their overall reading ability is controlled for). A multilevel analysis was conducted to test the hypotheses using the PISA 2009 reading dataset. When we controlled for students' overall reading ability, individual socioeconomic status (SES), and school mean SES, Shanghai-Chinese students performed significantly better in integrating and interpreting than US students. Further, when we controlled for students' overall reading ability and school mean SES, US students showed significantly higher performance in reading non-continuous texts than Shanghai-Chinese students, whereas US students showed significantly lower performance in reading continuous texts. The results of this study can inform reading instruction and learning in the 2 countries.
Minimum sample sizes of about 200 to 250 per group are often recommended for differential item functioning (DIF) analyses. However, there are times when sample sizes for one or both groups of interest are smaller than 200 due to practical... more
Minimum sample sizes of about 200 to 250 per group are often recommended for differential item functioning (DIF) analyses. However, there are times when sample sizes for one or both groups of interest are smaller than 200 due to practical constraints. This study attempts to examine the performance of Simultaneous Item Bias Test (SIBTEST), Cochran’s Z test, and loglinear smoothing with these methods in DIF detection accuracy at a number of small-sample and ability distribution combinations. Effects of item parameters and DIF magnitudes are also investigated. Results show that when ability distributions between groups are identical, Type I error for these DIF methods can be adequately controlled at all sample sizes, and their power to detect a large amount of unidirectional DIF can be tolerably high (power >6) when sample size is not too small (at least 100 per group). When ability distributions are different, Type I inflation is higher for easier items and larger sample sizes, and power depends on DIF direction. Log-linear smoothing with SIBTEST tends to lower both Type I error rate and power. The effect of smoothing with Cochran’s Z test is not as consistent. Implications of the findings are discussed.
A short version of the Counseling Center Assessment of Psychological Symptoms–62 (CCAPS-62) was created via three studies. The final short version (CCAPS-34), which contains 34 items and 7 subscales, demonstrated good discrimination... more
A short version of the Counseling Center Assessment of Psychological Symptoms–62 (CCAPS-62) was created via three studies. The final short version (CCAPS-34), which contains 34 items and 7 subscales, demonstrated good discrimination power, support for the proposed factor structure, strong initial convergent validity, and adequate test–retest stability over 1-week and 2-week intervals.
This study examines how Chinese ESL learners recognize English words while responding to a multiple-choice reading test as compared to Romance language-speaking ESL learners. Four adult Chinese ESL learners and three adult Romance... more
This study examines how Chinese ESL learners recognize English words while responding to a multiple-choice reading test as compared to Romance language-speaking ESL learners. Four adult Chinese ESL learners and three adult Romance language-speaking ESL learners participated in a think-aloud study with the Michigan English Language Assessment Battery (MELAB) reading test. As indicated by the think-aloud verbal reports, the Chinese ESL learners generally had more difficulty with English vocabulary probably due to the vast difference between the writing system of Chinese and that of English. Rather, they were found to compensate for their deficiencies in vocabulary knowledge by extensively relying on test-taking strategies. The findings of this study are well supported by the cross-linguistic transfer theory and the compensatory nature of reading comprehension. The implications for teaching English vocabulary skills to Chinese ESL learners are also discussed.
Designed to assess college students’ English ability, the College English Test (CET) is regarded as the most influential English test in China. This study investigates students’ perceptions of the impact of the CET on their... more
Designed to assess college students’ English ability, the College English Test (CET) is regarded as the most influential English test in China. This study investigates students’ perceptions of the impact of the CET on their English-learning practices and their affective conditions. A survey was administered to 150 undergraduate students at a university in Beijing. It was found that students perceived the impact of the CET to be pervasive. In particular, the majority of the respondents indicated that the CET had a greater impact on what they studied than on how they studied. Most of the students surveyed felt the CET had motivated them to make a greater effort to learn English. Many students seemed to be willing to put more effort on the language skills most heavily weighted in the CET. About half of the students reported a higher level of self-efficacy in regard to their overall English ability and some specific English skills as a result of taking or preparing for the CET. However, many students also reported experiencing increased pressure and anxiety in relation to learning English. This study provides important evidence about how the CET influences college students’ English learning in China, and directions for further research are also suggested.
In the present study we examined the ability of American and Chinese undergraduate students to calibrate their understanding of textbook passages translated into their native languages. Students read a series of texts and made predictions... more
In the present study we examined the ability of American and Chinese undergraduate students to calibrate their understanding of textbook passages translated into their native languages. Students read a series of texts and made predictions of their understanding of each text as well as the number of questions they would be able to answer correctly. Students also made postdictions of their test performance. Chinese students were significantly better than American students in calibrating their understanding of passages and predicting how many comprehension items they would answer correctly. Chinese students also outperformed American students on comprehension tests. All students were able to make more accurate postdictions of comprehension test scores than predictions. Results are related to possible instructional differences between American and Chinese students. Several possible directions for future research are discussed.
This article discusses a case study on an item writing process that reflects on our practical experience in an item development project. The purpose of the article is to share our lessons from the experience aiming to demystify item... more
This article discusses a case study on an item writing process that reflects on our practical experience in an item development project. The purpose of the article is to share our lessons from the experience aiming to demystify item writing process. The study investigated three issues that naturally emerged during the project: how item writers use test specifications in their item writing processes, how group dynamics affect the process, and what factors affect individual item writers’ item writing processes. The article provides practical implications regarding the design of test specifications, training item writers, and profiling item writers’ characteristics.
The College English Test (CET) in China is a high-stakes standardized test to assess college students' English ability. One frequent claim against this test is that teachers may teach to the test, which could narrow the curriculum and... more
The College English Test (CET) in China is a high-stakes standardized test to assess college students' English ability. One frequent claim against this test is that teachers may teach to the test, which could narrow the curriculum and turn regular English classes into CET coaching. This study aims to find out whether teachers are truly teaching to the test and the potential reasons involved. In order to gain deeper and more focused insight into the influence of the CET on classroom teaching, only its writing section was examined. Based on data collected from some students and teachers at a University in Beijing, China, it was found that the overall influence of the CET writing was not as substantial as what has been claimed. Due to different stakeholders' perceptions of the CET, the influence on teachers was weak and indirect compared to a stronger and more direct influence on students. Also, teachers did not teach to the test due to the lower priority of writing among the four language skills. The relatively low requirement of the CET writing and its restrictive testing format also prevented the teachers from teaching to the test. Finally, the teachers' lack of professional training and some logistic factors outweighed the influence of the CET writing. It is pointed out that teacher factor may outweigh the influence of the CET, and thus rigorous teacher training should be provided to improve the efficiency of classroom teaching.