Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
This study examines the differential item functioning (DIF) of the English version and the Japanese-translated version of the Multiple Affect Adjective Check List—Revised (MAACL-R) using the logistic regression (LR) procedure. The results... more
This study examines the differential item functioning (DIF) of the English version and the Japanese-translated version of the Multiple Affect Adjective Check List—Revised (MAACL-R) using the logistic regression (LR) procedure. The results of the LR are supplemented by multiple group confirmatory factor analysis (MGCFA). A total of five items are consistently flagged as showing DIF by both methods and based on the effect size criteria set forth by Jodoin and Gierl. One item shows nonuniform DIF, whereas four items show uniform DIF. Elimination or modification of the items with DIF can facilitate the appropriate use of the Japanese MAACL-R, especially in cross-cultural studies.
Lawmakers at the state level require good estimates of those without health insurance in the areas they serve to inform policy decisions. These estimates are often built on inadequate data from smaller geographic areas, such as counties.... more
Lawmakers at the state level require good estimates of those without health insurance in the areas they serve to inform policy decisions. These estimates are often built on inadequate data from smaller geographic areas, such as counties. The Small Area Estimates Branch of the U.S. Census Bureau developed a method to generate stable estimates at the county level using data from the Annual Social and Economic Supplement to the Current Population Survey and several other sources. Using data collected in the state of Tennessee, this article presents a less complicated and arguably less expensive alternative to that method, while providing comparable results. Limitations of both methods and suggestions for future research are discussed.
Traditional differential item functioning (DIF) analyses typically contrast the performances of two groups that differ in race, ethnicity, gender, family income, first language spoken at home, or other similar demographic variables.... more
Traditional differential item functioning (DIF) analyses typically contrast the performances of two groups that differ in race, ethnicity, gender, family income, first language spoken at home, or other similar demographic variables. Opportunity to learn (OTL) is an important variable in test bias research. The current research has focused on how to define the OTL and how to measure it. The process of data collection and analyses to evaluate OTL in a given situation is time and energy intensive. Given this, this study proposes that the Mantel-Haenszel (M-H) method, a common DIF method, can be used to identify sources of item bias and provide an efficient method to evaluate the adequacy of OTL in a targeted school. This study examines the feasibility of this method using the data from China Biology Olympiads Open Exam. DIF results are analyzed and compared with the content coverage review process of further OTL study.
ABSTRACT This study examines how Chinese ESL learners recognize English words while responding to a multiple-choice reading test as compared to Romance language-speaking ESL learners. Four adult Chinese ESL learners and three adult... more
ABSTRACT This study examines how Chinese ESL learners recognize English words while responding to a multiple-choice reading test as compared to Romance language-speaking ESL learners. Four adult Chinese ESL learners and three adult Romance language-speaking ESL learners participated in a think-aloud study with the Michigan English Language Assessment Battery (MELAB) reading test. As indicated by the think-aloud verbal reports, the Chinese ESL learners generally had more difficulty with English vocabulary probably due to the vast difference between the writing system of Chinese and that of English. Rather, they were found to compensate for their deficiencies in vocabulary knowledge by extensively relying on test-taking strategies. The findings of this study are well supported by the cross-linguistic transfer theory and the compensatory nature of reading comprehension. The implications for teaching English vocabulary skills to Chinese ESL learners are also discussed.
Lawmakers at the state level require good estimates of those without health insurance in the areas they serve to inform policy decisions. These estimates are often built on inadequate data from smaller geographic areas, such as counties.... more
Lawmakers at the state level require good estimates of those without health insurance in the areas they serve to inform policy decisions. These estimates are often built on inadequate data from smaller geographic areas, such as counties. The Small Area Estimates Branch of the U.S. Census Bureau developed a method to generate stable estimates at the county level using data from the Annual Social and Economic Supplement to the Current Population Survey and several other sources. Using data collected in the state of Tennessee, this article presents a less complicated and arguably less expensive alternative to that method, while providing comparable results. Limitations of both methods and suggestions for future research are discussed.
Evidence of education fever in China can be found from the 7th century through today. Both historical and contemporary education fever have been created, promoted and maintained by state-orchestrated systems of high-stakes and extremely... more
Evidence of education fever in China can be found from the 7th century through today. Both historical and contemporary education fever have been created, promoted and maintained by state-orchestrated systems of high-stakes and extremely competitive exams. This comparative study examines the educational and social consequences of both the historical Civil Service (Keju) exam system and the current National College Entrance Exam (NCEE) system in China. Although the two systems focus on different content domains and serve different explicit purposes – employment versus college entrance – both have led to profound effects on society. Similarities and differences in design features between the two systems were compared. A number of common positive and negative social consequences were identified. It was observed that exam-driven education fever in China has been gained at the expense of many unintended and often serious negative consequences. Education fever and these consequences are al...
The advent of online platforms such as Amazon’s Mechanical Turk (MTurk) has expanded considerably researchers’ options for collecting research data. Many researchers, however, express understandable skepticism of the viability of using... more
The advent of online platforms such as Amazon’s Mechanical Turk (MTurk) has expanded considerably researchers’ options for collecting research data. Many researchers, however, express understandable skepticism of the viability of using platforms such as MTurk. In this article, we provide a background on the use of MTurk as a mechanism for collecting research data. We then review what is currently known about the advantages and issues associated with using MTurk and highlight important areas for future research. We conclude by discussing implications of the use of crowdsourcing platforms such as MTurk for education research.
Statistical theories of measurement scores have evolved over the last 100 years. In 1904,
ABSTRACT Differential skill functioning (DSF) exists when examinees from different groups have different probabilities of successful performance in a certain subskill underlying the measured construct, given that they have the same... more
ABSTRACT Differential skill functioning (DSF) exists when examinees from different groups have different probabilities of successful performance in a certain subskill underlying the measured construct, given that they have the same ability on the overall construct. Using a DSF approach, this study examined the differences between two native language groups – a group with an East Asian language background and one with a Romance language background – in regard to reading subskills as represented in the Michigan English Language Assessment Battery (MELAB) reading test. Based on a combination of literature review and think-aloud reports from a sample of ESL students, hypotheses on reading subskill differences between the two groups were generated. These hypotheses were tested by first identifying the subskill profile of each examinee in a large MELAB database via the application of a previously determined item-skill Q-matrix to a Fusion Model of cognitive diagnostic modeling. The subskill profiles of the East Asian examinees were then compared against those of examinees with a Romance language background through logistic regression techniques. Some important DSFs were found between the two groups. Based on results of this study, instructional strategies were suggested to address some specific weaknesses in ESL learners’ reading subskills.
Test accommodations have been proposed to help overcome the unfair challenges faced by English Language Learners (ELLs) due to their relatively low English proficiency. A test accommodation is regarded as effective when it improves the... more
Test accommodations have been proposed to help overcome the unfair challenges faced by English Language Learners (ELLs) due to their relatively low English proficiency. A test accommodation is regarded as effective when it improves the test performance of ELLs. However, this improvement raises the question of whether such accommodations give ELLs an unfair advantage. One criterion used in determining a
ABSTRACT The outcomes on multiple-choice tests and performance-based assessments for field-independent and field-dependent students were examined. A substantial interaction between cognitive style and assessment approach was found.... more
ABSTRACT The outcomes on multiple-choice tests and performance-based assessments for field-independent and field-dependent students were examined. A substantial interaction between cognitive style and assessment approach was found. Results suggested that performance-based assessment tended to favor field-independent subjects. Dependent on the purpose and intended use of assessment, this finding may raise concerns for validity based on either fairness or curriculum relevance.
ABSTRACT Cognitive diagnostic analyses have been advocated as methods that allow an assessment to function as a formative assessment to inform instruction. To use this approach, it is necessary to first identify the skills required for... more
ABSTRACT Cognitive diagnostic analyses have been advocated as methods that allow an assessment to function as a formative assessment to inform instruction. To use this approach, it is necessary to first identify the skills required for each item in the test, known as a Q-matrix. However, because the construct being tested and the underlying cognitive processes associated with it are usually not fully understood, establishing a Q-matrix, especially for an existing test, is a challenging task. This study reports the process of constructing and validating a Q-matrix for the reading comprehension section of the Michigan English Language Assessment Battery (MELAB). An initial Q-matrix was first generated based on evidence gathered from related literature, students’ think-aloud protocols, and expert ratings. This initial Q-matrix was then validated empirically by applying the fusion model to a large MELAB data set. A well-supported Q-matrix was produced for potential future diagnostic applications.
A meta-analysis using Hierarchical Linear Modeling (HLM) was conducted to examine the effects of test accommodations on the test performance of English language learners (ELLs). The results indicated that test accommodations improve... more
A meta-analysis using Hierarchical Linear Modeling (HLM) was conducted to examine the effects of test accommodations on the test performance of English language learners (ELLs). The results indicated that test accommodations improve ELLs' test performance by about 0.157 standard deviations—a relatively small but statistically significant increase. Once the potential predictors that may have contributed to the variance of the effect sizes across studies had been accounted for, only English proficiency was found to be significant. Further, the results indicated that ELLs with a low level of English proficiency benefited much more from test accommodations than did those with a high level of English proficiency. Little difference was observed in regard to other factors such as students' ethnicity, students' grade level, or the subject for which they were being examined. Although previous studies have suggested that linguistic simplification may be more effective than other methods, results from this meta-analysis offered no support for that suggestion.
... Hoi K. Suen and Robert J. Stevens Abstract ... Successful refutation of this hypothesis, ie, a finding of statistical significance, means that chance (the "luck of the draw" in sampling) is unlikely to be the cause of the... more
... Hoi K. Suen and Robert J. Stevens Abstract ... Successful refutation of this hypothesis, ie, a finding of statistical significance, means that chance (the "luck of the draw" in sampling) is unlikely to be the cause of the observed relationship or dif-ference in the data. ...
Given the wide use of peer assessment, especially in higher education, the relative accuracy of peer ratings compared to teacher ratings is a major concern for both educators and researchers. This concern has grown with the increase of... more
Given the wide use of peer assessment, especially in higher education, the relative accuracy of peer ratings compared to teacher ratings is a major concern for both educators and researchers. This concern has grown with the increase of peer assessment in digital platforms. In this meta-analysis, using a variance-known hierarchical linear modelling approach, we synthesise findings from studies on peer assessment since 1999 when computer-assisted peer assessment started to proliferate. The estimated average Pearson correlation between peer and teacher ratings is found to be .63, which is moderately strong. This correlation is significantly higher when: (a) the peer assessment is paper-based rather than computer-assisted; (b) the subject area is not medical/clinical; (c) the course is graduate level rather than undergraduate or K-12; (d) individual work instead of group work is assessed; (e) the assessors and assessees are matched at random; (f) the peer assessment is voluntary instead of compulsory; (g) the peer assessment is non-anonymous; (h) peer raters provide both scores and qualitative comments instead of only scores; and (i) peer raters are involved in developing the rating criteria. The findings are expected to inform practitioners regarding peer assessment practices that are more likely to exhibit better agreement with teacher assessment.
AUTHOR Suen, Hoi K.; And Others TITLE Generalizability Assessment of Autocorrelated Direct Observation Data: The Applicability of the Tiao-Tan Method and Alternative. PUB SATE Feb 88 NOTE 25p.; Paper presented at the Annual Meeting of the... more
AUTHOR Suen, Hoi K.; And Others TITLE Generalizability Assessment of Autocorrelated Direct Observation Data: The Applicability of the Tiao-Tan Method and Alternative. PUB SATE Feb 88 NOTE 25p.; Paper presented at the Annual Meeting of the Eastern Educational Research Association (Miami Beach, FL, February 24-27, 1988). PUB TYPE Reports Evaluative/Feasibility (142) -Speeches /Conference Papers (150)
This study demonstrates the advantages of using a constrained optimization algorithm to explore the optimal number of prompts, modes of discourse, and raters for achieving an acceptable level of reliability during a direct writing... more
This study demonstrates the advantages of using a constrained optimization algorithm to explore the optimal number of prompts, modes of discourse, and raters for achieving an acceptable level of reliability during a direct writing assessment. Writing samples elicited from 50 college students were rated by 3 graduate students and the scores submitted to a generalizability analysis (G-study). The variance components estimated in the G-study were then used in a branch-and-bound integer programming procedure to determine the optimal number of raters, modes, and prompts to produce a reliable writing assessment. Four different scenarios were examined to show how the optimal answer changes based on the priorities of the measurement situation. (Contains 2 tables, 3 figures, and 24 reference-.) (Author/SLD) * Reproductions supplied by EDRS are the best that can be made from the original locument. U S DEPARTMENT OF EDUCATION Oth Ce 01 Edu CatiOnal FIeSeIrCh nd Improvement (DU ATIONAL RESOURCE...
As cross-cultural comparisons are becoming more prevalent, there is an increasing need for linguistic and cultural equivalence of original and translated tests. With increasing popularity of Kuder ® assessments in career guidance... more
As cross-cultural comparisons are becoming more prevalent, there is an increasing need for linguistic and cultural equivalence of original and translated tests. With increasing popularity of Kuder ® assessments in career guidance counseling, it is imperative that great efforts and attention are placed on obtaining cultural and linguistic equivalent tests. Recently, Kuder ® assessments have been translated from English to Korean. Based on the Holland theory of Career Choice, test developers constructed equivalent tests using their understanding of existing cultural and linguistic difference and back translation method. However, test translation can lead to construct, method, and item bias. In order to ensure valid assessments, both statistical and judgmental methods are utilized to address issues. In addition to context and cultural sensitivity review by experts, statistical methods for item level analysis are critical in detecting potentially biased items for the Kuder ® Skills Conf...
With the advent of computer technology, students can now be informed about their progress and provided feedback limited only to the level of efficiency. For the 1999-2000 school year, a computer reporting program, TigerNet, was instituted... more
With the advent of computer technology, students can now be informed about their progress and provided feedback limited only to the level of efficiency. For the 1999-2000 school year, a computer reporting program, TigerNet, was instituted in a junior high school in Pennsylvania. Through a year-long investigation, data were collected to determine the effects of TigerNet on academic performance. Achievement data were available for 394 students, and parent responses to a questionnaire were received from 460 parents. A path analysis reveals a direct, positive relationship between student use of the system and academic performance. There is also a positive relationship, in an indirect sense, between teacher use and academic performance. This paper discusses the ramifications of these two positive relationships, and how these results are shown in the light of a qualitative analysis of parental responses. The Academic Achievement Motivation Survey adapted from I. Russell (1969) and the par...
Students who are enrolled in MOOCs tend to have different motivational patterns than fee-paying college students. A majority of MOOC students demonstrate characteristics akin more to "tourists" than formal learners. As a... more
Students who are enrolled in MOOCs tend to have different motivational patterns than fee-paying college students. A majority of MOOC students demonstrate characteristics akin more to "tourists" than formal learners. As a consequence, MOOC studentsΓCO completion rate is usually very low. The current study examines the relations among student motivation, engagement, and retention using structural equation modeling and data from a Penn State University MOOC. Three distinct types of motivation are examined: intrinsic motivation, extrinsic motivation, and social motivation. Two main hypotheses are tested: (a) motivation predicts student course engagement; and (b) student engagement predicts their retention in the course. The results show that motivation is significantly predictive of student course engagement. Furthermore, engagement is a strong predictor of retention. The findings suggest that promoting student motivation and monitoring individual studentsΓCO online activities...
Education Fever is not a new phenomenon, particularly in Asian countries such as Korea, China, and Vietnam. We see evidence of parents’ concerns and enthusiasm about their children’s education and government’s strong promotion of... more
Education Fever is not a new phenomenon, particularly in Asian countries such as Korea, China, and Vietnam. We see evidence of parents’ concerns and enthusiasm about their children’s education and government’s strong promotion of education throughout history in these countries. As one of the numerous pieces of evidence of such promotion in ancient China, consider the popular “Urge to Study Poem 勸學詩” written by Emperor Zhenzong 宋真宗 (986-1022) of the Song dynasty in China about 1,000 years ago (see Guo, 1994):
The development of massive open online courses (MOOCs) has launched an era of large-scale interactive participation in education. While massive open enrolment and the advances of learning technology are creating exciting potentials for... more
The development of massive open online courses (MOOCs) has launched an era of large-scale interactive participation in education. While massive open enrolment and the advances of learning technology are creating exciting potentials for lifelong learning in formal and informal ways, the implementation of efficient and effective assessment is still problematic. To ensure that genuine learning occurs, both assessments for learning (formative assessments), which evaluate students’ current progress, and assessments of learning (summative assessments), which record students’ cumulative progress, are needed. Providers’ more recent shift towards the granting of certificates and digital badges for course accomplishments also indicates the need for proper, secure and accurate assessment results to ensure accountability. This article examines possible assessment approaches that fit open online education from formative and summative assessment perspectives. The authors discuss the importance of, and challenges to, implementing assessments of MOOC learners’ progress for both purposes. Various formative and summative assessment approaches are then identified. The authors examine and analyse their respective advantages and disadvantages. They conclude that peer assessment is quite possibly the only universally applicable approach in massive open online education. They discuss the promises, practical and technical challenges, current developments in and recommendations for implementing peer assessment. They also suggest some possible future research directions.RésuméMéthodes d’évaluation dans les formations en ligne ouvertes à tous : possibilités, défis et futures orientations – L’essor des formations en ligne ouvertes à tous (FLOT) ouvre la voie à une ère de la participation interactive de masse à l’éducation. Tandis que l’inscription libre et massive ainsi que les avancées des technologies d’apprentissage créent des possibilités prometteuses pour l’apprentissage tant formel qu’informel tout au long de la vie, la réalisation d’une évaluation efficiente et efficace demeure un obstacle. Pour garantir un véritable apprentissage, il est nécessaire d’effectuer à la fois des évaluations pour l’apprentissage (évaluations formatives) qui mesurent les progrès actuels des apprenants, et les évaluations de l’apprentissage (évaluations sommatives) qui recensent les progrès cumulés des apprenants. La récente tendance des prestataires à attribuer des certificats et insignes numériques sanctionnant la réussite aux cours signale aussi la nécessité de résultats d’évaluation appropriés, sécurisés et précis qui garantissent la responsabilité. L’article examine les approches possibles d’évaluation qui correspondent à la formation en ligne ouverte à tous sous l’angle de l’évaluation formative et sommative. Les auteurs signalent l’importance et les défis d’évaluer les progrès des apprenants des FLOT dans ces deux buts. Ils identifient plusieurs approches d’évaluation formative et sommative en examinant et analysant leurs avantages et inconvénients respectifs. Ils concluent que l’évaluation entre pairs est fort probablement la seule approche universellement applicable dans la formation en ligne ouverte à tous. Ils en présentent les aspects prometteurs, les défis pratiques et techniques, l’évolution actuelle dans la réalisation de ce type d’évaluation ainsi que des recommandations. Ils proposent enfin plusieurs orientations possibles pour de futures études.摘要用于慕课的评估方法: 机会, 挑战及未来发展方向 – 慕课开启了大规模互动学习的新时代。 教育科技的进步为终身学习创造了很多机会, 但同时如何实现高效的学习评估也成为一个很大的挑战。 为了帮助学生学习, 形成性评估 (给学生提供阶段性反馈) 和总结性评估 (评估教学的最终效果) 都是必要的手段。 很多慕课开始向课程完成者颁发电子证书, 这一趋势也使得安全有效的评估变得尤为重要。 本文阐述了评估在慕课中的重要性和所面临的挑战, 并介绍了适用于慕课的形成性和总结性评估方法, 并对不同方法的优势和劣势进行了分析。 作者认为在慕课中, 学生互评是一个普遍适用的评估方法。 本文还对学生互评的优势、 挑战、 发展趋势以及实际应用中的问题进行了探讨, 最后提出了慕课评估方法未来的发展方向。
Sampling fluctuation is an inherent characteristic of data. We cannot pretend that it does not exist, nor can we abandon significance testing without a viable replacement. It remains a concern as long as we wish to draw research... more
Sampling fluctuation is an inherent characteristic of data. We cannot pretend that it does not exist, nor can we abandon significance testing without a viable replacement. It remains a concern as long as we wish to draw research conclusions within a positivistic paradigm. It is there regardless of whether we conduct a large-group or a single-subject study. It is also
Considerable attention has been given to the issue of parent-professional congruence, specifically in connection with reliability of assessments. Concerns regarding the trustworthiness of parental assessments have guided research to focus... more
Considerable attention has been given to the issue of parent-professional congruence, specifically in connection with reliability of assessments. Concerns regarding the trustworthiness of parental assessments have guided research to focus on the conventional issues of interrater reliability and rater interchangeability. However, this conventional perspective may be misdirected and counterproductive. It is argued that the focus should be the reliability of the pooled assessment information from parents and professionals. A more appropriate generalizability theoretic model, which would maximize social, ecological, and construct validity, is proposed. An illustration with data from an early childhood assessment system is provided.
This information was obtained by inspecting the enrollment rosiers for each semester. Students who were in the sample but were not in the third semester roster and had not graduated were op-erationally defined as having dropped out from... more
This information was obtained by inspecting the enrollment rosiers for each semester. Students who were in the sample but were not in the third semester roster and had not graduated were op-erationally defined as having dropped out from the university. The degree of ...
ABSTRACT This paper addresses the consequences of high stakes testing on the feasi-bility of psychometric analyses of the test, focusing on the extremely high stakes college entrance exams in China and Korea as cases in point. There is a... more
ABSTRACT This paper addresses the consequences of high stakes testing on the feasi-bility of psychometric analyses of the test, focusing on the extremely high stakes college entrance exams in China and Korea as cases in point. There is a fundamental conflict between the serious concern for test security in high stakes testing and the need to conduct field tests to a large sample of poten-tial examinees for quality assurance prior to test administration. We propose several judgmental or qualitative approaches as the preferred methods to opti-mize psychometric qualities and suggest that they take the central role in the development of very high stakes tests.
Many college professors create multiple-choice and other types of tests for use in the classroom or for research purposes. However, most have not been exposed to the principles involved with test development. The purpose of this workshop... more
Many college professors create multiple-choice and other types of tests for use in the classroom or for research purposes. However, most have not been exposed to the principles involved with test development. The purpose of this workshop is to provide an introduction to test development including writing test questions, more commonly known as test "items." Through a combination of both hands-on activities and lecture, participants will be led through the steps of test development starting with an introduction to test specifications, practice with item writing and analysis, and finally ending with a discussion on reliability and validity. This information could prove helpful for those participants who wish to better their own classroom tests as well as for those individuals evaluating or researching educational programs.
... Login to save citations to My List. Citation. Database: PsycINFO. [Comment/Reply]. Agreement, reliability, accuracy, and validity: Toward a clarification. Suen, Hoi K. Behavioral Assessment, Vol 10(4), Win 1988, 343-366. Abstract. ...
ABSTRACT

And 71 more