Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Vahid Aryadoust
  • Vahid ARYADOUST (Dr) | Assistant Professor | English Language and Literature |
    National Institute of Education
    NIE3-03-97, 1 Nanyang Walk, Singapore 637616
    Tel: (65) 6790-3475 GMT+8h | Fax: (65) 6896-9149 | Email: vahid.aryadoust@nie.edu.sg |
    Web: http://www.nie.edu.sg/profile/aryadoust-vahid
  • Vahid Aryadoust, PhD, is Assistant Professor in the English Language and Literature Academic Group. He is the Associa... moreedit
A fundamental requirement of language assessments which is underresearched in computerized assessments is impartiality (fairness) or equal treatment of test takers regardless of background. The present study aimed to evaluate fairness in... more
A fundamental requirement of language assessments which is underresearched in computerized assessments is impartiality (fairness) or equal treatment of test takers regardless of background. The present study aimed to evaluate fairness in the Pearson Test of English (PTE) Academic Reading test, which is a computerized reading assessment, by investigating differential item functioning (DIF) across Indo-European (IE) and Non-Indo-European (NIE) language families. Previous research has shown that similarities between readers’ mother tongue and the second language being learned can advantage some test takers. To test this hypothesis, we analyzed data from 783 international test takers who took the PTE Academic test, using the partial credit model in Rasch measurement. We examined two main types of DIF: uniform DIF (UDIF), which occurs when an item consistently gives a particular group of test takers an advantage across all levels of ability, and non-uniform DIF (NUDIF), which occurs when the performance of test takers varies across the ability continuum. The results showed no statistically significant UDIF (p > 0.05), but identified 3 NUDIF items out of 10 items across the language families. A mother tongue advantage was not observed. Similarity in test takers’ level of computer and Internet skills, test preparation, and language policies could contribute to the finding of no UDIF. Post-hoc content analysis of items suggested that the decrease of mother tongue advantage for IE groups in high-proficiency groups and lucky guesses of low-ability groups may have contributed to the emergence of NUDIF items. Lastly, recommendations for investigating social and contextual factors are proposed.
The computerization of reading assessments has presented a set of new challenges to test designers. From the vantage point of measurement invariance, test designers must investigate whether the traditionally recognized causes for... more
The computerization of reading assessments has presented a set of new challenges to test designers. From the vantage point of measurement invariance, test designers must investigate whether the traditionally recognized causes for violating invariance are still a concern in computer-mediated assessments. In addition, it is necessary to understand the technology-related causes of measurement invariance among test-taking populations. In this study, we used the available data (n = 800) from the previous administrations of the Pearson Test of English Academic (PTE Academic) reading, an international test of English comprising 10 test items, to investigate measurement invariance across gender and the Information and Communication Technology Development index (IDI). We conducted a multi-group confirmatory factor analysis (CFA) to assess invariance at four levels: configural, metric, scalar, and structural. Overall, we were able to confirm structural invariance for the PTE Academic, which is a necessary condition for conducting fair assessments. Implications for computer-based education and the assessment of reading are discussed.

https://www.sciencedirect.com/science/article/pii/S0191491X19301452
Volume I of Quantitative Data Analysis for Language Assessment is a resource book that presents the most fundamental techniques of quantitative data analysis in the field of language assessment. Each chapter provides an accessible... more
Volume I of Quantitative Data Analysis for Language Assessment is a resource book that presents the most fundamental techniques of quantitative data analysis in the field of language assessment. Each chapter provides an accessible explanation of the selected technique, a review of language assessment studies that have used the technique, and finally, an example of an authentic study that uses the technique. Readers also get a taste of how to apply each technique through the help of supplementary online resources that include sample data sets and guided instructions. Language assessment students, test designers, and researchers should find this a unique reference as it consolidates theory and application of quantitative data analysis in language assessment.
The purpose of the present study was twofold: (a) it examined the relationship between peer-rated likeability and peer-rated oral presentation skills of 96 student presenters enrolled in a science communication course, and (b) it... more
The purpose of the present study was twofold: (a) it examined the relationship between peer-rated likeability and peer-rated oral presentation skills of 96 student presenters enrolled in a science communication course, and (b) it investigated the relationship between student raters’ severity in rating presenters’ likeability and their severity in evaluating presenters’ skills. Students delivered an academic presentation and then changed roles to rate their peers’ performance and likeability, using an 18-item oral presentation scale and a 10- item likeability questionnaire, respectively. Many-facet Rasch measurement was used to validate the data, and structural equation modeling (SEM) was used to examine the research questions. At an aggregate level, likeability explained 19.5% of the variance of the oral presentation ratings and 8.4% of rater severity. At an item-level, multiple cause-effect relationships were detected, with the likeability items explaining 6–30% of the variance in the oral presentation items. Implications of the study are discussed.
Research Interests:
This chapter describes the listening section of the Internet-Based Test of English as a Foreign Language (TOEFL iBT) which was designed by Educational Testing Service (ETS). The TOEFL iBT is administered in many testing centers around the... more
This chapter describes the listening section of the Internet-Based Test of English as a Foreign Language (TOEFL iBT) which was designed by Educational Testing Service (ETS). The TOEFL iBT is administered in many testing centers around the world and is used to measure academic English language proficiency of test candidates who are applying to universities whose primary language of instruction and research is English.
Research Interests:
This chapter aims to demonstrate how peer assessment can be used to generate information in support of teaching and learning in Singapore and other educational settings. The chapter reports on the development of the tertiary-level English... more
This chapter aims to demonstrate how peer assessment can be used to generate information in support of teaching and learning in Singapore and other educational settings. The chapter reports on the development of the tertiary-level English oral presentation scale (TEOPS) which is used in a science communication module in a major Singaporean university. A survey of peer assessment and oral presentations is conducted and the multicomponential model of TEOPS is presented. In addition, the importance of the assessment of oracy and presentation skills in Singapore is discussed and a narration of the validation studies of TEOPS, which use many-facet Rasch measurement (MFRM) and students' perception, is presented. The author elaborates on how this scale can be used for peer assessment and provides directions for future research on the peer assessment of oral presentations.
Research Interests:
This entry seeks to examine second language (L2) listening comprehension from a subskill-based approach. It provides an overview of two models of listening comprehension, that is, the default listening construct and the listening-response... more
This entry seeks to examine second language (L2) listening comprehension from a subskill-based approach. It provides an overview of two models of listening comprehension, that is, the default listening construct and the listening-response model, and delineates listening subskills. It also proposes a list of the subskills that have been identified and validated through empirical research. The entry concludes by discussing the potential relationships between the subskills and the limitation of the listening comprehension research. APA citation: Aryadoust, V. (2017). Taxonomies of listening skills. In J. I. Liontas and M. DelliCarpini (Eds.), The TESOL encyclopedia of English language teaching. John Wiley in partnership with TESOL International.
Research Interests:
Two models of listening comprehension are presented: a cognitive model for non-assessment settings and a language proficiency model which has been applied extensively to the assessment of listening. The similarities of the models are then... more
Two models of listening comprehension are presented: a cognitive model for non-assessment settings and a language proficiency model which has been applied extensively to the assessment of listening. The similarities of the models are then discussed, and a general framework for communicative assessment of listening is proposed. The framework considers socio-cognitive aspects of listening assessment and lends itself to both in-class and beyond-class assessment situations.
Research Interests:
The Academic Listening Self-rating Questionnaire (ALSA) is a 47-item self-appraisal tool which helps language learners evaluate their own academic listening skills (Aryadoust, Goh, & Lee, 2012). The underlying dimensions of ALSA consist... more
The Academic Listening Self-rating Questionnaire (ALSA) is a 47-item self-appraisal tool which helps language learners evaluate their own academic listening skills (Aryadoust, Goh, & Lee, 2012). The underlying dimensions of ALSA consist of linguistic components and prosody, cognitive processing skills, relating input to other materials, note-taking, memory and concentration, and lecture structure. The psychometric quality of ALSA has been studied using the Rating Scale Rasch model, structural equation modeling, and correlation analyses. The ALSA can be used to raise tertiary-level students' awareness of their academic listening ability, and of the elements of academic discourse such as lectures and seminars that may affect their academic achievement. Further research is being undertaken to provide validity evidence for two versions of the instrument in Chinese and Turkish, respectively.
Research Interests:
Coh-Metrix has emerged as a promising psycholinguistic tool in writing and reading research. Researchers have used Coh-Metrix to predict English proficiency of first and second language learners. The common statistical method used in... more
Coh-Metrix has emerged as a promising psycholinguistic tool in writing and reading research. Researchers have used Coh-Metrix to predict English proficiency of first and second language learners. The common statistical method used in predictive modeling research is a multiple linear regression model, which has achieved varying degrees of success. This chapter examines the relative merits of the learning/validation method applied in previous Coh-Metrix studies and then proposes genetic algorithm-based symbolic regression as an alternative and efficient approach which provides robust evidence for the predictive power of some of the Coh-Metrix indices. Using a sample of papers written by university students (n = 450), the author demonstrates that genetic algorithm-based symbolic regression is capable of significantly minimizing the error of measurement and providing a much clearer understanding of the data.
Research Interests:
In this chapter, we aim to justify a neurobiological approach to language assessment. We argue that neuroscience and genetics offer great potential for language assessment research, specifically in defining and operationalizing language... more
In this chapter, we aim to justify a neurobiological approach to language assessment. We argue that neuroscience and genetics offer great potential for language assessment research, specifically in defining and operationalizing language constructs and validation processes, and also for better understanding of second language acquisition and the changes within the brain that relate directly to changes in proficiency levels. Further, we note that converging evidence from the fields of language assessment, cognitive neuroscience, and genetics is enabling the reconceptualization of test takers' competence and performance (see Fox & Hirotani, this volume). Integrating current data analysis methods used in language testing, neuroscience, and genetics would lend a multi-dimensional perspective to assessment and take into consideration the advancements in language assessment, psychometrics, and neuroscience.
Research Interests:
Data from 230 test takers who answered 60 reading test items in an Iranian reading test were subjected to Rasch measurement analysis to yield item difficulty parameters. Seven Coh-Metrix attributes (left embeddedness, CELEX, preposition... more
Data from 230 test takers who answered 60 reading test items in an Iranian reading test were subjected to Rasch measurement analysis to yield item difficulty parameters. Seven Coh-Metrix attributes (left embeddedness, CELEX, preposition phrase density, verb overlap, imagability of content words, text easability, and lexical diversity) were used as variables to sort test items into two difficulty categories, high-difficulty and low-difficulty. An artificial neural network (ANN) model was applied, with 47 items (82%) used to train the network, 10 (17.5%) items used for testing, and three excluded. The model correctly categorized test items in 89.4% and 100% of cases in the training and testing samples, respectively. The most important variable in classifying items was left embeddedness, an index of syntactic complexity, and the least important was lexical diversity. Overall, the study shows that neural networks have a high precision in classifying low and high difficulty reading test items.
Research Interests:
Over the past few decades, the field of language assessment has grown in importance, sophistication, and scope. The increasing internationalization of educational and work contexts, heightened global understanding of the role of... more
Over the past few decades, the field of language assessment has grown in importance, sophistication, and scope. The increasing internationalization of educational and work contexts, heightened global understanding of the role of assessment in learning (e.g., Black & Wiliam, 2001; Fox, 2014; Rea-Dickens, 2001), greater emphasis on the assessment of educational outcomes (e.g., Biggs & Tang, 2007), and the concomitant expansion of the language testing industry (e.g., Alderson, 2009) have led to unprecedented changes in assessment practices and approaches. These advancements, spurred on by technological innovation and a burgeoning array of new data analysis techniques, have prompted some to suggest (e.g., McNamara, 2014) that language assessment is on the verge of a revolution....
Research Interests:
http://www.cambridgescholars.com/trends-in-language-assessment-research-and-practice Despite prodigious developments in the field of language assessment in the Middle East and the Pacific Rim, research and practice in these areas have... more
http://www.cambridgescholars.com/trends-in-language-assessment-research-and-practice
Despite prodigious developments in the field of language assessment in the Middle East and the Pacific Rim, research and practice in these areas have been underrepresented in mainstream literature. This volume takes a fresh look at language assessment in these regions, and provides a unique overview of contemporary language assessment research. In compiling this book, the editors have tapped into the knowledge of language and educational assessment experts whose diversity of perspectives and experience has enriched the focus and scope of language and educational assessment in general, and the present volume in particular. The six ‘trends’ addressed in the 26 chapters that comprise this title consider such contemporary topics as data mining, in-class assessment, and washback. The contributors explore new approaches and techniques in language assessment including advances resulting from multidisciplinary collaboration with researchers in computer science, genetics, and neuroscience. The current trends and promising new directions identified in this volume and the research reported here suggest that researchers across the Middle East and the Pacific Rim are playing—and will continue to play—an important role in advancing the quality, utility, and fairness of language testing and assessment practices.
Research Interests:
Our interest in putting together the present volume grew out of a burgeoning stream of research into language assessment in the Middle East and the Pacific Rim. As the focus on education and the role of English language teaching continues... more
Our interest in putting together the present volume grew out of a burgeoning stream of research into language assessment in the Middle East and the Pacific Rim. As the focus on education and the role of English language teaching continues to intensify across these regions at an unprecedented rate, assessing communication skills becomes an increasingly significant field. Some of the major universities in these regions have had a long history in teaching and assessing English and other languages, and researchers, practitioners, and scholars alike have attempted to develop innovative assessment approaches and techniques to address the pressing needs of language test developers and test takers. At the same time, multiple annual conferences, such as Pacific Rim Objective Measurement Symposium (PROMS) and the Asian Association for Language Assessment (AALA) conference, have been launched to bring scholars together and keep them updated about the latest developments in language and educational assessment in these regions.
Research Interests:
Bagheri, M.S., Nikpoor, S., & Aryadoust, S.V. (2007). Crack IELTS in a flash. Shiraz: Sandbad Publication.
Aryadoust, V., Akbarzadeh, S., Afarinesh, A.  (2008). A guidebook to passages 2. Shiraz: Sandbad Publication.
Aryadoust, V., Akbarzadeh S., & Nasiri, E. (2007). IELTS writing tutor, writing task2, general and academic. Tehran: Jungle Publication. Aryadoust, V., Akbarzadeh S., & Nasiri, E. (2007). IELTS writing tutor, writing task1, academic... more
Aryadoust, V., Akbarzadeh S., & Nasiri, E. (2007). IELTS writing tutor, writing task2, general and academic. Tehran: Jungle Publication.

Aryadoust, V., Akbarzadeh S., & Nasiri, E. (2007). IELTS writing tutor, writing task1, academic module. Tehran: Jungle Publication.         

Aryadoust, V., Akbarzadeh S., & Nasiri, E. (2007). IELTS writing tutor, writing task1, general module. Tehran: Jungle Publication.
Aryadoust, V. (2007). A dictionary of sociolinguistics, plus pragmatics and languages. Shiraz: Faramatn Publication.
Aryadoust, V. (2006). A guidebook to passages 1. Shiraz: Sandbad Publication.
A number of scaling models—developed originally for psychological studies—have been adapted into language assessment. Although their application has been promising, they have not yet been validated in language assessment contexts. This... more
A number of scaling models—developed originally for psychological studies—have been adapted into language assessment. Although their application has been promising, they have not yet been validated in language assessment contexts. This study discusses the relative merits of two such models in the context of second language (L2) listening comprehension tests: confirmatory factor analysis (CFA) and cognitive diagnostic models (CDMs). Both CFA and CDMs model multidimensionality in assessment tools, whereas other models force the data to be statistically unidimensional. The two models were applied to the listening test of the Michigan English Language Assessment Battery (MELAB). CFA was found to impose more restrictions on the data than CDM. It is suggested that CFA might not be suitable for modelling dichotomously scored data of L2 listening tests, whereas the CDM used in the study (the Fusion Model) appeared to successfully portray the listening sub-skills tapped by the MELAB listening test. The paper concludes with recommendations about how to use each of these models in modelling L2 listening.
Although second language (L2) listening assessment has been the subject of much research interest in the past few decades, there remain a multitude of challenges facing the definition and operationalization of the L2 listening... more
Although second language (L2) listening assessment has been the subject of much research interest in the past few decades, there remain a multitude of challenges facing the definition and operationalization of the L2 listening construct(s). Notably, the majority of L2 listening assessment studies are based upon the (implicit) assumption that listening is reducible to cognition and metacognition. This approach ignores emotional, neurophysiological, and sociocultural mechanisms underlying L2 listening. In this paper, the role of these mechanisms in L2 listening assessment is discussed and four gaps in understanding are explored: the nature of L2 listening, the interaction between listeners and the stimuli, the role of visuals, and authenticity in L2 listening assessments. Finally, a review of the papers published in the special issue is presented and recommendations for further research on L2 listening assessments are provided.
Mobile-assisted language learning (MALL) is a novel approach to language learning and teaching. The present study aims to review the methodological quality of quantitative MALL research by focusing on the applications of statistical... more
Mobile-assisted language learning (MALL) is a novel approach to language learning and teaching. The present study aims to review the methodological quality of quantitative MALL research by focusing on the applications of statistical techniques and instrument reliability and validity. A total of 174 papers within 41 journals identified using the Scopus database were screened and coded. Of these, 77 quantitative MALL studies that investigated English as a foreign or second
language using mobile devices met the inclusion criteria. In the full-text screening, each study was coded for the statistical techniques applied, assumptions reported, reliability and validity investigation of the instruments and coding practices used. The results show the ubiquity of the general linear model (GLM) (i.e., mean-based data analysis, such as t-test, univariate analysis of variances (ANOVA), multivariate analysis of variances (MANOVA)), with 61.40% of the analyzed studies using this statistical method. Notably, the majority of studies that used GLM did not report confirmation of the fundamental assumptions (i.e., normality, homogeneity of variance, and linearity) of such analysis. In addition, a reliance of null hypothesis significance testing was observed without reporting of the practical significance of the investigated effect or relations (effect size). Lastly, less than half of MALL studies reported reliability and even fewer studies reported validity evidence, indicating a lack of evidence of the precision, meaningfulness of data, and accuracy of many of the measurement instruments used. Implications of these findings for MALL research are discussed, with several suggestions for future research.
The aim of this study was to investigate how test methods affect listening test takers' performance and cognitive load. Test methods were defined and operationalized as whilelistening performance (WLP) and post-listening performance (PLP)... more
The aim of this study was to investigate how test methods affect listening test takers' performance and cognitive load. Test methods were defined and operationalized as whilelistening performance (WLP) and post-listening performance (PLP) formats. To achieve the goal of the study, we examined test takers' (N = 80) brain activity patterns (measured by functional near-infrared spectroscopy (fNIRS)), gaze behaviors (measured by eye-tracking), and listening performance (measured by test scores) across the two test methods. We found that the test takers displayed lower activity levels across brain regions supporting comprehension during the WLP tests relative to the PLP tests. Additionally, the gaze behavioral patterns exhibited during the WLP tests suggested that the test takers adopted keyword matching and "shallow listening." Together, the neuroimaging and gaze behavioral data indicated that the WLP tests imposed a lower cognitive load on the test takers than the PLP tests. However, the test takers performed better with higher test scores for one of two WLP tests compared with the PLP tests. By incorporating eye-tracking and neuroimaging in this exploration, this study has advanced the current knowledge on cognitive load and the impact imposed by different listening test methods. To advance our knowledge of test validity, other researchers could adopt our research protocol and focus on extending the test method framework used in this study.
This study sought to examine research trends in computer-assisted language learning (CALL) using a retrospective scientometric approach. Scopus was used to search for relevant publications on the topic and generate a dataset consisting of... more
This study sought to examine research trends in computer-assisted language learning (CALL) using a retrospective scientometric approach. Scopus was used to search for relevant publications on the topic and generate a dataset consisting of 3,697 studies published in 11 journals between 1977 and 2020. A document co-citation analysis method was adopted to identify the main research clusters in the dataset. The impact of each publication on the field was measured by using the burst index and the betweenness centrality and the content of influential publications was closely analysed to determine the focus of each cluster and the key themes of the studies in focus. Overall, we identified seven major clusters. We further found that leveraging synchronous computer-mediated communication and negotiated interaction, multimedia, telecollaboration or e-mail exchanges, blogs, digital games, Wikis and podcasts to support language learning was probably beneficial for language learning. Varying degrees of support were found in various studies for each of these technologies. Stronger support was found for synchronous computer-mediated communication and negotiated interaction, multimedia, telecollaboration or e-mail exchanges and digital games and weaker support was found for blogs, Wikis, and podcasts. The limitations the supporting studies listed were also considered inconsequential. On the other hand, while there was strong support for blogs, Wikis and podcasts, some major drawbacks were observed. The findings of the study would be helpful for teachers and instructors who want to decide whether to use technology in the classroom for instructional purposes. Additionally, researchers and graduate students who need to identify a research topic for their thesis or dissertation may find the results of the study useful for them, too.
This study aimed to investigate the test-taking strategies needed for successful completion of a lecture-based listening test by employing self-reported test-taking strategy use, actual strategy use measured via eye-tracking, and test... more
This study aimed to investigate the test-taking strategies needed for successful completion of a lecture-based listening test by employing self-reported test-taking strategy use, actual strategy use measured via eye-tracking, and test scores. In this study, participants’ gaze behavior (measured by fixation and visit duration and frequency) were recorded while they completed two listening tests of three stages each: pre-listening, in which participants (n = 66) previewed question stems; while-listening, in which participants simultaneously listened to the recording and filled in their answers; and post-listening, in which they had time to review their answers and make necessary amendments. Following the listening tests, participants filled up a posttest questionnaire that asked about their strategy use in each of the three stages. Rasch measurement, t-test, and path analysis were performed on test scores, questionnaire results, and gaze patterns. Results suggest that gaze measures (visit duration and fixation frequency) predicted participants’ final test performance, while self-reports had moderate predicting power. The findings of this study have implications for the cognitive validity of listening tests, listening test design and pedagogical approaches in building listening competence.
The present study conducted a systematic review of the item response theory (IRT) literature in language assessment to investigate the conceptualization and operationalization of the dimensionality of language ability. Sixty-two IRT-based... more
The present study conducted a systematic review of the item response theory (IRT) literature in language assessment to investigate the conceptualization and operationalization of the dimensionality of language ability. Sixty-two IRT-based studies published between 1985 and 2020 in language assessment and educational measurement journals were first classified into two categories based on a unidimensional and multidimensional research framework, and then reviewed to examine language dimensionality from technical and substantive perspectives. It was found that 12 quantitative techniques were adopted to assess language dimensionality. Exploratory factor analysis was the primary method of dimensionality analysis in papers that had applied unidimensional IRT models, whereas the comparison modeling approach was dominant in the multidimensional framework. In addition, there was converging evidence within the two streams of research supporting the role of a number of factors such as testlets, language skills, subskills, and linguistic elements as sources of multidimensionality, while mixed findings were reported for the role of item formats across research streams. The assessment of reading, listening, speaking, and writing skills was grounded within both unidimensional and multidimensional framework. By contrast, vocabulary and grammar knowledge was mainly conceptualized as unidimensional. Directions for continued inquiry and application of IRT in language assessment are provided.
This is the second neurocognitive study of language assessments produced in our lab. In addition to the experiment, we have proposed the concept of neurocognitive validity in language assessment. We are working towards expanding on this... more
This is the second neurocognitive study of language assessments produced in our lab. In addition to the experiment, we have proposed the concept of neurocognitive validity in language assessment. We are working towards expanding on this framework. We believe neurocognitive approaches to learning and assessment will be the future of education, and it is best that pertinent frameworks be proposed and tested now.
<Abstract>
With the advent of new technologies, assessment research has adopted technology- based methods to investigate test validity. This study investigated the neurocognitive processes involved in an academic listening comprehension test, using a biometric technique called functional near-infrared spectroscopy (fNIRS). Sixteen right-handed university students completed two tasks: (1) a linguistic task that involved listening to a mini-lecture (i.e., Listening condition) and answering of questions (i.e., Questions condition) and (2) a nonlinguistic task that involved listening to a variety of natural sounds and animal vocalizations (i.e., Sounds condition). The hemodynamic activity in three left brain regions was measured: the inferior frontal gyrus (IFG), dorsomedial prefrontal cortex (dmPFC), and posterior middle temporal gyrus (pMTG). The Listening condition induced higher activity in the IFG and pMTG than the Sounds condition. Although not statistically significant, the activity in the dmPFC was higher during the Listening condition than in the Sounds conditions. The IFG was also significantly more active during the Listening condition than in the Questions condition. Although a significant gender difference was observed in listening comprehension test scores, there was no difference in brain activity (across the IFG, dmPFC, and pMTG) between male and female participants. The implications for test validity are discussed.
This study set out to investigate intellectual domains as well as the use of measurement and validation methods in language assessment research and second language acquisition (SLA) published in English in peer-reviewed journals. Using... more
This study set out to investigate intellectual domains as well as the use of measurement and validation methods in language assessment research and second language acquisition (SLA) published in English in peer-reviewed journals. Using Scopus, we created two datasets: (i) a dataset of core journals consisting of 1,561 articles published in four language assessment journals, and (ii) a dataset of general journals consisting of 3,175 articles on language assessment published in the top journals of SLA and applied linguistics. We applied document co-citation analysis to detect thematically distinct research clusters. Next, we coded citing papers in each cluster based on an analytical framework for measurement and validation. We found that the focus of the core journals was more exclusively on reading and listening comprehension assessment (primary), facets of speaking and writing performance such as raters and validation (secondary), as well as feedback, corpus linguistics, and washback (tertiary). By contrast, the primary focus of assessment research in the general journals was on vocabulary, oral proficiency, essay writing, grammar, and reading. The secondary focus was on affective schemata, awareness, memory, language proficiency, explicit vs. implicit language knowledge, language or semantic awareness, and semantic complexity. With the exception of language proficiency, this second area of focus was absent in the core journals. It was further found that the majority of citing publications in the two datasets did not carry out inference-based validation on their instruments before using them. More research is needed to determine what motivates authors to select and investigate a topic, how thoroughly they cite past research, and what internal (within a field) and external (between fields) factors lead to the sustainability of a Research Topic in language assessment.
Over the past decades, the application of Rasch measurement in language assessment has gradually increased. In the present study, we reviewed and coded 215 papers using Rasch measurement published in 21 applied linguistics journals for... more
Over the past decades, the application of Rasch measurement in language assessment has gradually increased. In the present study, we reviewed and coded 215 papers using Rasch measurement published in 21 applied linguistics journals for multiple features. We found that seven Rasch models and 23 software packages were adopted in these papers, with many-facet Rasch measurement (n = 100) and Facets (n = 113) being the most frequently used Rasch model and software, respectively. Significant differences were detected between the number of papers that applied Rasch measurement to different language skills and components, with writing (n = 63) and grammar (n = 12) being the most and least frequently investigated, respectively. In addition, significant differences were found between the number of papers reporting person separation (n = 73, not reported: n = 142) and item separation (n = 59, not reported: n = 156) and those that did not. An alarming finding was how few papers reported unidimensionality check (n = 57 vs 158) and local independence (n = 19 vs 196). Finally, a multilayer network analysis revealed that research involving Rasch measurement has created two major discrete communities of practice (clusters), which can be characterized by features such as language skills, the Rasch models used, and the reporting of item reliability/separation vs person reliability/separation. Cluster 1 was accordingly labelled the production and performance cluster, whereas cluster 2 was labelled the perception and language elements cluster. Guidelines and recommendations for analyzing unidimensionality, local independence, data-to-model fit, and reliability in Rasch model analysis are proposed.
This is the first study to investigate the effects of test methods (while-listening performance and post-listening performance) and gender on measured listening ability and brain activation under test conditions. Functional near-infrared... more
This is the first study to investigate the effects of test methods (while-listening performance and post-listening performance) and gender on measured listening ability and brain activation under test conditions. Functional near-infrared spectroscopy (fNIRS) was used to examine three brain regions associated with listening comprehension: the inferior frontal gyrus and posterior middle temporal gyrus, which subserve bottom-up processing in comprehension, and the dorsomedial prefrontal cortex, which mediates top-down processing. A Rasch model reliability analysis showed that listeners were homogeneous in their listening ability. Additionally, there were no significant differences in test scores across test methods and genders. The fNIRS data, however, revealed significantly different activation of the investigated brain regions across test methods, genders, and listening abilities. Together, these findings indicated that the listening test was not sensitive to differences in the neurocognitive processes underlying listening comprehension under test conditions. The implications of these findings for assessing listening and suggestions for future research are discussed.
Even though the field of linguistics has witnessed a growth of research in the areas of comprehension (listening and reading) subskills, there is currently no universally accepted taxonomy for categorizing them. Using a dataset of 192... more
Even though the field of linguistics has witnessed a growth of research in the areas of comprehension (listening and reading) subskills, there is currently no universally accepted taxonomy for categorizing them. Using a dataset of 192 publications, a document co-citation analysis was conducted. Eighteen discrete research clusters were identified, comprising 73 empirically investigated comprehension subskills, of which 55 were related to first language (L1) comprehension and 18 were associated with second language (L2) comprehension. Fifteen research clusters (83.33%) were focused on lower-order L1 processing abilities in reading such as orthographic processing and speeded word reading. The remaining three clusters were relatively small, and focused on L2 comprehension subskills. The list of subskills was visualized in the form of a codex that serves as the first integrative framework for empirically investigated comprehension subskills and processing abilities. The need for conducting experimental investigations to improve the understanding of L2 comprehension subskills was highlighted.
A recent review of the literature concluded that Rasch measurement is an influential approach in psychometric modeling. Despite the major contributions of Rasch measurement to the growth of scientific research across various fields, there... more
A recent review of the literature concluded that Rasch measurement is an influential approach in psychometric modeling. Despite the major contributions of Rasch measurement to the growth of scientific research across various fields, there is currently no research on the trends and evolution of Rasch measurement research. The present study used co-citation techniques and a multiple perspectives approach to investigate
5,365 publications on Rasch measurement between 01 January 1972 and 03 May 2019 and their 108,339 unique references downloaded from the Web of Science (WoS). Several methods of network development involving visualization and text-mining were used to analyze these data: author co-citation analysis (ACA), document co-citation analysis (DCA), journal author co-citation analysis (JCA), and keyword analysis. In
addition, to investigate the inter-domain trends that link the Rasch measurement specialty to other specialties, we used a dual-map overlay to investigate specialty-to-specialty connections. Influential authors, publications, journals, and keywords were identified. Multiple research frontiers or sub-specialties were detected and the major ones were
reviewed, including “visual function questionnaires”, “non-parametric item response theory”, “validmeasures (validity)”, “latent classmodels”, and “many-facet Rasch model”. One of the outstanding patterns identified was the dominance and impact of publications written for general groups of practitioners and researchers. In personal communications, the authors of these publications stressed their mission as being “teachers” who aim to promote Rasch measurement as a conceptual model with real-world applications. Based on these findings, we propose that sociocultural and ethnographic factors have
a huge capacity to influence fields of science and should be considered in future investigations of psychometrics and measurement. As the first scientometric review of the Rasch measurement specialty, this study will be of interest to researchers, graduate students, and professors seeking to identify research trends, topics, major publications, and influential scholars.
A recent review of the literature concluded that Rasch measurement is an influential approach in psychometric modeling. Despite the major contributions of Rasch measurement to the growth of scientific research across various fields, there... more
A recent review of the literature concluded that Rasch measurement is an influential approach in psychometric modeling. Despite the major contributions of Rasch measurement to the growth of scientific research across various fields, there is currently no research on the trends and evolution of Rasch measurement research. The present study used co-citation techniques and a multiple perspectives approach to investigate 5,365 publications on Rasch measurement between 01 January 1972 and 03 May 2019 and their 108,339 unique references downloaded from the Web of Science (WoS). Several methods of network development involving visualization and text-mining were used to analyze these data: author co-citation analysis (ACA), document co-citation analysis (DCA), journal author co-citation analysis (JCA), and keyword analysis. In addition, to investigate the inter-domain trends that link the Rasch measurement specialty to other specialties, we used a dual-map overlay to investigate specialty-to-specialty connections. Influential authors, publications, journals, and keywords were identified. Multiple research frontiers or sub-specialties were detected and the major ones were reviewed, including "visual function questionnaires", "non-parametric item response theory", "valid measures (validity)", "latent class models", and "many-facet Rasch model". One of the outstanding patterns identified was the dominance and impact of publications written for general groups of practitioners and researchers. In personal communications, the authors of these publications stressed their mission as being "teachers" who aim to promote Rasch measurement as a conceptual model with real-world applications. Based on these findings, we propose that sociocultural and ethnographic factors have a huge capacity to influence fields of science and should be considered in future investigations of psychometrics and measurement. As the first scientometric review of the Rasch measurement specialty, this study will be of interest to researchers, graduate students, and professors seeking to identify research trends, topics, major publications, and influential scholars.
Eye tracking technology has become an increasingly popular methodology in language studies. Using data from 27 journals in language sciences indexed in the Social Science Citation Index and/or Scopus, we conducted an in-depth... more
Eye tracking technology has become an increasingly popular methodology in language studies. Using data from 27 journals in language sciences indexed in the Social Science Citation Index and/or Scopus, we conducted an in-depth scientometric analysis of 341 research publications together with their 14,866 references between 1994 and 2018. We identified a number of countries, researchers, universities, and institutes with large numbers of publications in eye tracking research in language studies. We further discovered a mixed multitude of connected research trends that have shaped the nature and development of eye tracking research. Specifically, a document co-citation analysis revealed a number of major research clusters, their key topics, connections, and bursts (sudden citation surges). For example, the foci of clusters #0 through #5 were found to be perceptual learning, regressive eye movement(s), attributive adjective(s), stereotypical gender, discourse processing, and bilingual adult(s). The content of all the major clusters was closely examined and synthesized in the form of an in-depth review. Finally, we grounded the findings within a data-driven theory of scientific revolution and discussed how the observed patterns have contributed to the emergence of new trends. As the first scientometric investigation of eye tracking research in language studies, the present study offers several implications for future research that are discussed.
Research Interests:
The aim of the present study is two-fold. Firstly, it uses eye tracking to investigate the dynamics of item reading, both in multiple choice (MCQ) and matching items, before and during two hearings of listening passages in a computerized... more
The aim of the present study is two-fold. Firstly, it uses eye tracking to investigate the dynamics of item reading, both in multiple choice (MCQ) and matching items, before and during two hearings of listening passages in a computerized while-listening performance (WLP) test. Secondly, it investigates answer changing during the two hearings, which include four rounds of item reading taking place during: pre-listening in hearing 1, while-listening in hearing 1, pre-listening in hearing 2, and while-listening in hearing 2. The listening test was completed by 28 secondary school students in different sessions. Using time series, cross-correlation functions, and multivariate data analyses, we found that listeners tended to quickly skim the test items, distractors, and answers during pre-listening in hearing 1 and pre-listening in hearing 2. By contrast, during while-listening in hearing 1 and while-listening in hearing 2, significantly more attention was paid to the written stems, distractors, and options. The increment in attention to the written stems, distractors, and options was greater for the matching items and interactions between item format and item reading were also detected. Additionally, we observed a mixed answer changing pattern (i.e., incorrect-to-correct and correct-to-incorrect), although the dominant pattern for both item formats (67%) was wrong-to-correct. Implications of the findings for language research are discussed.
This study investigates the dimensions of visual mental imagery (VMI) in aural discourse comprehension. We introduce a new approach to inspect VMIs which integrates forensic arts and latent class analysis. Thirty participants listened to... more
This study investigates the dimensions of visual mental imagery (VMI) in aural discourse comprehension. We introduce a new approach to inspect VMIs which integrates forensic arts and latent class analysis. Thirty participants listened to three descriptive oral excerpts and then verbalized what they had seen in their mind’s eye. The verbalized descriptions were simultaneously illustrated by two trained artists using the Adobe PhotoshopVR and the digital drawing tablets with electromagnetic induction technology, generating approximations of the VMIs. Next, a code sheet was developed to examine the illustrated VMIs on 16 dimensions. Latent class analysis identified three classes of VMI imaginers with nine discriminating dimensions: clarity, completeness of figures, details, shape crowdedness, shapeadded features, texture, space, time and motion, and flamboyance. The groups classes further differentiated by significant differences in their listening abilities. An individual lacking the ability to imagine (a condition called Aphantasia) and some
evidence that VMIs in listening are both symbolic and depictive were also found.
This study investigates the underlying structure of the listening test of the Singapore-Cambridge General Certificate of Education (GCE) exam, comparing the fit of five cognitive diagnostic assessment models comprising the deterministic... more
This study investigates the underlying structure of the listening test of the Singapore-Cambridge General Certificate of Education (GCE) exam, comparing the fit of five cognitive diagnostic assessment models comprising the deterministic input noisy “and” gate (DINA), generalized DINA (G-DINA), deterministic input noisy “or” gate (DINO), higher-order DINA (HO-DINA), and the reduced reparameterized unified model (RRUM). Through model-comparisons, a nine-subskill RRUM model was found to possess the optimal fit. The study shows that students’ listening test performance depends on an array of test-specific facets, such as the ability to eliminate distractors in multiple-choice questions alongside listening-specific subskills such as the ability to make inferences. The validated list of the listening subskills can be employed as a useful guideline to prepare students for the GCE listening test at schools.
Research Interests:
A B S T R A C T The present study applied recursive partitioning Rasch trees to a large-scale reading comprehension test (n = 1550) to identify sources of DIF. Rasch trees divide the sample by subjecting the data to recursive non-linear... more
A B S T R A C T The present study applied recursive partitioning Rasch trees to a large-scale reading comprehension test (n = 1550) to identify sources of DIF. Rasch trees divide the sample by subjecting the data to recursive non-linear partitioning and estimate item difficulty per partition. The variables used in the recursive partitioning of the data were vocabulary and grammar knowledge and gender of the test takers. This generated 11 non-pre-specified DIF groups, for which the item difficulty parameters varied significantly. This is grounded within the third generation of DIF analysis and it is argued that DIF induced by the readers' vocabulary and grammar knowledge is not construct-irrelevant. In addition, only 204 (13.16%) test takers who had significantly high grammar scores were affected by gender DIF. This suggests that DIF caused by manifest variables only influences certain subgroups of test takers with specific ability profiles, thus creating a complex network of relationships between construct-relevant and-irrelevant variables.
Research Interests:
This article proposes an integrated cognitive theory of reading and listening that draws on a maximalist account of comprehension and emphasizes the role of bottom-up and top-down processing. The theoretical framework draws on the... more
This article proposes an integrated cognitive theory of reading and listening that draws on a maximalist
account of comprehension and emphasizes the role of bottom-up and top-down processing. The
theoretical framework draws on the findings of previous research and integrates them into a coherent
and plausible narrative to explain and predict the comprehension of written and auditory inputs. The
theory is accompanied by a model that schematically represents the fundamental components of the
theory and the comprehension mechanisms described. The theory further highlights the role of perception
and word recognition (underresearched in reading research), situation models (missing in listening
research), mental imagery (missing in both streams), and inferencing. The robustness of the theory is
discussed in light of the principles of scientific theories adopted from Popper (1959).
Research Interests:
To cite this article: Vahid Aryadoust & Mehdi Riazi (2017) Future directions for assessing for learning in second language writing research: epilogue to the special issue, Educational Psychology, 37:1, 82-89,
Research Interests:
Research into second language writing has developed in depth and scope over the past few decades. Researchers have shown growing interest in new approaches to the teaching and assessing of writing. The provision of diagnostic and/or... more
Research into second language writing has developed in depth and scope over the past few decades. Researchers have shown growing interest in new approaches to the teaching and assessing of writing.
The provision of diagnostic and/or (automated) corrective feedback (Lee & Coniam, 2013; Liu & Kunnan, 2016), predicting writers’ ability using psycholinguistic features of their essays (Riazi, 2016) and rater
performance (Schaefer, 2008) are but a few major research streams in second language writing. Such new approaches have been discussed heatedly in the scholarly literature, but there remains a need to investigate new issues emerging from these fields in different environments. Specifically, the role of assessment in writing, the validity of the uses and interpretations of qualitative feedback and scores,
and the effectiveness of genre-based approaches to writing continue to be major causes of concerns for practitioners and researchers alike.
Research Interests:
This study adapts Levels 1 and 2 of Kirkpatrick’s model of training evaluation to evaluate learning outcomes of an English as a second language (ESL) paragraph writing course offered by a major Asian university. The study uses a... more
This study adapts Levels 1 and 2 of Kirkpatrick’s model of training evaluation to evaluate learning outcomes of an English as a second language (ESL) paragraph writing course offered by a major Asian university. The study uses a combination of surveys and writing tests administered at the beginning and end of the course. The survey evaluated changes in students’ perception of their skills, attitude, and knowledge (SAK), and the writing tests measured their writing ability. Rasch measurement was applied to examine the psychometric validity of the instruments. The measured abilities were successively subjected to path modeling to evaluate Levels 1 and 2 of the model. The students reported that the module was enjoyable and useful. In addition, their self-perceived level of skills and knowledge developed across time alongside their writing scores but their attitude remained unchanged. Limitations of Kirkpatrick’s model as well as lack of solid frameworks for evaluating educational effectiveness in applied linguistics are discussed.
Research Interests:
The fairness and precision of peer assessment have been questioned by educators and academics. Of particular interest, yet poorly understood, are the factors underlying the biases that cause unfair and imprecise peer assessments. To shed... more
The fairness and precision of peer assessment have been questioned by educators and academics. Of particular interest, yet poorly understood, are the factors underlying the biases that cause unfair and imprecise peer assessments. To shed light on this issue, I investigated gender and academic major biases in peer assessments of oral presentations. The study sample comprised 66 science students enrolled in a formative assessment-based communication module at an Asian university. Each student presented an oral presentation in English and also evaluated 10–14 of their classmates’ oral presentations. The students’ evaluations were anchored by the instructor’s evaluation of each oral presentation. I performed many-facet Rasch measurement (MFRM) for two purposes: (a) to examine the effect of multiple facets on the student and teacher ratings of oral presentations and (b) to adjust the ratings on oral presentations according to gender and academic major biases. The scores assigned by student raters had good fit to MFRM; however, when students evaluated oral presentations by peers of the opposite sex, the scores were overestimated. An academic major bias was also observed, where students consistently underestimated the scores of same-major peers. After adjusting for biases, it was concluded that peer assessments can be a reliable and useful form of formative assessment.
Research Interests:
This study applies evolutionary algorithm-based (EA-based) symbolic regression to assess the ability of metacognitive strategy use tested by the metacognitive awareness listening questionnaire (MALQ) and lexico-grammatical knowledge to... more
This study applies evolutionary algorithm-based (EA-based) symbolic regression to assess the ability of metacognitive strategy use tested by the metacognitive awareness listening questionnaire (MALQ) and lexico-grammatical knowledge to predict listening comprehension proficiency among English learners. Initially, the psychometric validity of the MALQ subscales, the lexico-grammatical test, and the listening test was examined using the logistic Rasch model and the Rasch-Andrich rating scale model. Next, linear regression found both sets of predictors to have weak or inconclusive effects on listening comprehension; however, the results of EA-based symbolic regression suggested that both lexico-grammatical knowledge and two of the five metacognitive strategies tested predicted strongly and nonlinearly listening proficiency (R2 = .64). Constraining prediction modeling to linear relationships is argued to jeopardize the validity of language assessment studies, potentially leading these studies to inaccurately contradict otherwise well-established language assessment hypotheses and theories.
Research Interests:
It has been argued that item difficulty can affect the fit of a confirmatory factor analysis (CFA) model (McLeod, Swygert, & Thissen, 2001; Sawaki, Sticker, & Andreas, 2009). We explored the effect of items with outlying difficulty... more
It has been argued that item difficulty can affect the fit of a confirmatory factor analysis (CFA) model (McLeod, Swygert, & Thissen, 2001; Sawaki, Sticker, & Andreas, 2009). We explored the effect of items with outlying difficulty measures on the CFA model of the listening module of International English Language Testing System (IELTS). The test has four sections comprising 40 items altogether (10 items in each section). Each section measures a different listening skill making the test a conceptually four-dimensional assessment instrument.
Research Interests:
We sought to develop a measurement instrument that evaluated Singaporean early teens’ societal and environmental consciousness; we call the instrument Singaporean Societal and Environmental Consciousness Inventory (SSECI).
Research Interests:
Forty science students received training for 12 weeks on delivering effective presentations and using a tertiary-level English oral presentation scale comprising three subscales (Verbal Communication, Nonverbal Communication, and Content... more
Forty science students received training for 12 weeks on delivering effective presentations and using a tertiary-level English oral presentation scale comprising three subscales (Verbal Communication, Nonverbal Communication, and Content and Organization) measured by 18 items. For their final project, each student was given 10 to 12 min to present on 1 of the 5 compulsory science books for the module and was rated by the tutor, peers, and himself/herself. Many-facet Rasch measurement, correlation, and analysis of variance were performed to mine the data. The results show that the student raters, tutor, items, and rating scales achieved high psychometric quality, though a small number of assessments exhibited bias. Although all of the biased self-assessments were underestimations of presentation skills, the peer and tutor assessment bias had a mixed pattern. In addition, self-, peer, and tutor assessments had low to medium correlations on the subscales, and a significant difference was found between the assessments. Implications are discussed.
Research Interests:
The present study used the mixed Rasch model (MRM) to identify subgroups of readers within a sample of students taking an EFL reading comprehension test. Six hundred and two (602) Chinese college students took a reading test and a... more
The present study used the mixed Rasch model (MRM) to identify subgroups of readers within a sample of students taking an EFL reading comprehension test. Six hundred and two (602) Chinese college students took a reading test and a lexico-grammatical knowledge test and completed a Metacognitive and Cognitive Strategy Use Questionnaire (MCSUQ) (Zhang, Goh, & Kunnan, 2014). MRM analysis revealed two latent classes. Class 1 was more likely to score highly on reading in-depth (RID) items. Students in this class had significantly higher general English proficiency, better lexico-grammatical knowledge, and reported using reading strategies more frequently, especially planning, monitoring, and integrating strategies. In contrast, Class 2 was more likely to score highly on skimming and scanning (SKSN) items, but had relatively lower mean scores for lexico-grammatical knowledge and general English proficiency; they also reported using strategies less frequently than did Class 1. The implications of these findings and further research are discussed.
Research Interests:
The present study uses a mixture Rasch model to examine latent differential item functioning in English as a foreign language listening tests. Participants (n = 250) took a listening and lexico-grammatical test and completed the... more
The present study uses a mixture Rasch model to examine latent differential item functioning in English as a foreign language listening tests. Participants (n = 250) took a listening and lexico-grammatical test and completed the metacognitive awareness listening questionnaire comprising problem solving (PS), planning and evaluation (PE), mental translation (MT), person knowledge (PK), and directed attention (DA). The listening test was subjected to MRM analysis where a two-latent class model had a sufficient fit. Next, an artificial neural network and a chi-square test were used to examine the nature of the latent classes. Class 1 comprised high-ability listeners capable of multitasking and obtained high PS, PE, and lexico-grammatical test scores but low DA, PK, and MT scores. Class 2 comprised low-ability listeners with limited multitasking skills who obtained high DA, PK, and MT scores but low scores on PS, PE, and the lexico-grammatical test. Finally, a model of listening
comprehension is postulated and discussed.

Keywords: artificial neural network, gender, item response theory, lexicogrammatical knowledge, listening comprehension, metacognitive strategy awareness, mixture Rasch measurement
This study aims to invoke a theoretical model to link the linguistic features of text complexity, as measured by Coh-Metrix, and text quality, as measured by human raters. 163 Chinese EFL learners wrote sample expository and persuasive... more
This study aims to invoke a theoretical model to link the linguistic features of text complexity, as measured by Coh-Metrix, and text quality, as measured by human raters. 163 Chinese EFL learners wrote sample expository and persuasive essays that were marked by four trained raters using a writing scale comprising Word Choice, Ideas, Organization, Voice, Conventions, and Sentence Fluency traits. The psychometric reliability of the writing scores was investigated using many-facet Rasch measurement. Based on the construction-integration (CI) model of comprehension, three levels of mental representation were delineated for the essays: the surface level (lexicon and syntax), the textbase, and the situation model. Multiple proxies for each level were created using Coh-Metrix, a computational tool measuring various textual features. Using structural equation modeling (SEM), the interactions between the three levels of representation, text quality, and tasks were investigated. The SEM with the optimal fit comprised 23 observed Coh-Metrix variables measuring various latent variables. The results show that tasks affected the situation model and several surface level latent variables. Multiple interactions were identified between writing quality and levels of representation, such as the Syntactic Complexity latent variable predicting the situation model and the situation model latent variable predicting Conventions and Organization. Implications for writing assessment research are discussed.
Keywords: Coh-Metrix; construction-integration model; Rasch measurement; situation model; structural equation modeling; surface structure; textbase
Research Interests:
A few computer-assisted language learning (CALL) instruments have been developed in Iran to measure EFL (English as a foreign language) learners’ attitude toward CALL. However, these instruments have no solid validity argument and... more
A few computer-assisted language learning (CALL) instruments have been developed in Iran to measure EFL (English as a foreign language) learners’ attitude toward CALL. However, these instruments have no solid validity argument and accordingly would be unable to provide a reliable measurement of attitude. The present study aimed to develop a CALL attitude instrument (CALLAI) to be used in the Iranian EFL context.
A pool of 633 survey items was developed and 27 items were judged to be appropriate for measuring CALL attitude. The chosen items were translated and back-translated by
experts and were administered to 1001 Iranian EFL learners. The psychometric features of the items were examined using three primary data analysis techniques: principal component analysis (PCA), confirmatory factor analysis (CFA), and the Rasch-Andrich rating scale model. Finally, a validity argument for CALLAI was developed which comprised five primary inferences. The findings from the psychometric analysis were mapped onto the validity framework. The validity framework is generally well supported, although adding a few items could yield higher reliability coefficients.
Research Interests:
Research shows that test method can exert a significant impact on test takers’ performance and thereby contaminate test scores. We argue that common test method can exert the same effect as common stimuli and violate the conditional... more
Research shows that test method can exert a significant impact on test takers’ performance and thereby contaminate test scores. We argue that common test method can exert the same effect as common stimuli and violate the conditional independence assumption of item response theory models because, in general, subsets of items which have a shared feature are a source of response dependence (Marais & Andrich, 2008). In this study, we use the Rasch testlet model (Wang & Wilson, 2005a) to examine the effect of test method on violating the unidimensionality assumption of the Rasch model. Results show that test formats can introduce small to large construct-irrelevant variance, contaminate test scores, and lead to the violation of the conditional independence assumption. Our findings further suggest that the degree of construct-irrelevant variance exerted by test method could be a function of test format familiarity.
Keywords: conditional independence, Rasch testlet model, test method
Research Interests:
The testing and teaching of listening has been partially guided by the notion of subskills, or a set of listening abilities that are needed for achieving successful comprehension and utilization of the information from listening texts.... more
The testing and teaching of listening has been partially guided by the notion of subskills, or a set of listening abilities that are needed for achieving successful comprehension and utilization of the information from listening texts. Although this notion came about mainly through applications of theoretical perspectives from psychology and communication studies, the actual divisibility of the subskills has rarely been examined. This article reports an attempt to do so by using data from the answers of 916 test takers of a retired version of the Michigan English Language Assessment Battery listening test. First, an iterative content analysis of items was carried out, identifying five key subskills. Next, the discriminability of subskills was examined through confirmatory factor analysis (CFA). Five independent measurement models representing the subskills were evaluated. The overall CFA model comprising the measurement models showed excessively high correlations among factors. Further tests through CFA resolved the inadmissible correlations, though the high correlations persisted. Finally, we made 23 aggregate-level items which were used in a higher-order model, which induced best fit indices and resolved the inadmissible estimates. The results show that the subskills in the test were empirically divisible, lending support to scholarly attempts in discussing components in the listening construct for the purpose of teaching and assessment.
Research Interests:
This study investigates the development in paragraph writing ability of 116 undergraduate English as a second language (ESL) students enrolled in a paragraph writing course. Students wrote sample paragraphs before, during, and after the... more
This study investigates the development in paragraph writing ability of 116 undergraduate English as a second language (ESL) students enrolled in a paragraph writing course. Students wrote sample paragraphs before, during, and after the course, and these were marked on an analytical scale by multiple expert raters. The results were first subjected to many-facet Rasch model (MFRM) analysis to measure differences in rater severity and identify rater misfits; raters’ scores were anchored to these initial results to generate fair scores for students. Next, a curve-of-factors latent growth model was fitted to the scores. The results showed that students’ ability in multiple writing skills grew gradually and linearly from the beginning of the course. This progress was found to be independent of the writing prompts. Students’ development is attributed to a variety of facilitative factors, including explicit lessons and frequent practice, regular feedback through a continuous assessment (CA) approach and various opportunities to engage with class tutors, and the use of online technology in the course. - See more at: http://www.ajsotl.edu.sg/article/examining-the-development-of-paragraph-writing-ability-of-tertiary-esl-students-a-continuous-assessment-study/#sthash.gN47hzGg.dpuf
Research Interests:
This study sought to examine the development of paragraph writing skills of 116 English as a second language university students over the course of 12 weeks and the relationship between the linguistic features of students’ written... more
This study sought to examine the development of paragraph writing skills of
116 English as a second language university students over the course of
12 weeks and the relationship between the linguistic features of students’
written texts as measured by Coh-Metrix – a computational system for estimating
textual features such as cohesion and coherence – and the scores
assigned by human raters. The raters’ reliability was investigated using
many-facet Rasch measurement (MFRM); the growth of students’ paragraph
writing skills was explored using a factor-of-curves latent growth model
(LGM); and the relationships between changes in linguistic features and
writing scores across time were examined by path modelling. MFRM analysis
indicates that despite several misfits, students’ and raters’ performances
and scale’s functionality conformed to the expectations of MFRM, thus providing
evidence of psychometric validity for the assessments. LGM shows
that students’ paragraph writing skills develop steadily during the course.
The Coh-Metrix indices have more predictive power before and after the
course than during it, suggesting that Coh-Metrix may struggle to discriminate between some ability levels. Whether a Coh-Metrix index gains or loses predictive power over time is argued to be partly a function of whether
raters maintain or lose sensitivity to the linguistic feature measured by that index in their own assessment as the course progresses.

Keywords: Coh-Metrix; factor-of-curves latent growth model; linguistic
features; many-facet Rasch measurement; paragraph writing
Research Interests:
Research Interests:
""This article reports the development of the Test Takers’ Metacognitive Awareness Reading Questionnaire (TMARQ) which measure test takers’ metacognition in reading comprehension tests. TMARQ comprises seven subscales: planning... more
""This article reports the development of the Test Takers’ Metacognitive Awareness Reading Questionnaire (TMARQ) which measure test takers’ metacognition in reading comprehension tests. TMARQ comprises seven subscales: planning strategies, evaluating strategies, monitoring strategies, strategies for identifying important information, inference-making strategies, integrating strategies, and supporting strategies. In this article, a validity argument is laid out for the questionnaire by presenting content-referenced, substantive, and structural evidence of validity, which is primarily yielded through Rasch measurement and structural equation modeling.
http://link.springer.com/article/10.1007%2Fs40299-013-0083-z

Keywords: metacognitive awareness; reading test; Rasch measurement; structural equation modeling; validity
""
The purpose of this paper is to examine the psychometric features of the International English Language Competency Assessment (IELCA) listening test. Specifically, it explores the reliability and underlying structure of the test and... more
The purpose of this paper is to examine the psychometric features of the  International English Language Competency Assessment (IELCA) listening test. Specifically, it explores the reliability and underlying structure of the test and sheds a light on test method effects.
Abstract—This study reports a novel application of the Adaptive Neuro Fuzzy Inference Systems (ANFIS) to a second language listening test, and compares it with path modeling of observed variables. Seven variables were defined and... more
Abstract—This study reports a novel application of the Adaptive Neuro Fuzzy Inference Systems (ANFIS) to a second language listening test, and compares it with path modeling of observed variables. Seven variables were defined and hypothesized to influence the primary dependent variable, test item difficulty. Next, a matrix of these eight variables was developed and subjected to ANFIS and path modeling. ANFIS analysis found stronger effects for several of the seven explanatory variables. Path modeling captured some of the same effects through a mediating variable, test section, which captures aggregate differences across different subsections of the test. In general, neurofuzzy models (NFMs) appear to be a promising tool in language and educational assessment.

Keywords: Adaptive Neuro-Fuzzy Inference Systems (ANFIS); item difficult; listening test
"This article reports on the development and administration of the Academic Listening Self-Assessment Questionnaire (ALSA). The ALSA was developed on the basis of a proposed model of academic listening comprising six related components.... more
"This article reports on the development and administration of the Academic Listening Self-Assessment Questionnaire (ALSA). The ALSA was developed on the basis of a proposed model of academic listening comprising six related components. The researchers operationalized the model, subjected items to iterative rounds of content analysis, and administered the finalized questionnaire to international ESL (English as a second language) students in Malaysian and Australian universities. Structural equation modeling and Rasch rating scale modeling of data provided content-related, substantive, and structural validity evidence for the instrument. The researchers explain the utility of the questionnaire for educational and assessment purposes.

Keywords: academic listening, language testing, Rasch Rating Scale model self-assessment, structural equation modeling"
Language self-appraisal (or self-assessment) is a process by which students evaluate their own language competence. This article describes the relationship between students’ self-appraisals and their performance on a measure of academic... more
Language self-appraisal (or self-assessment) is a process by which students evaluate their own language competence. This article describes the relationship between students’ self-appraisals and their performance on a measure of academic listening (AL). Following Aryadoust and Goh (2011), AL was defined as a multi-componential construct including cognitive processing skills, linguistic components and prosody, note-taking, rating input to other materials, knowledge of lecture structure, and memory and concentration. Participants (n = 63) were given a self-assessment questionnaire which is founded upon the components of AL presented by Aryadoust and Goh, and a test of academic listening developed by English Testing Service (ETS); subsequently, their performance on both measures were found to be correlated. Significant correlations were apparent, indicating that learners assessed their listening skills fairly accurately and precisely. Pedagogical implications and applications of self-assessment are discussed in this paper.

http://blog.nus.edu.sg/eltwo/2012/05/29/reliability-of-second-language-listening-self-assessments-implications-for-pedagogy/
SEM applied to language assessment tools
In the first installment of this article, I reviewed cognitive diagnostic assessment (CDA) and mentioned its advantages over other latent trait methods. I argued that the difficulty of a task can be accounted for by multiple factors or... more
In the first installment of this article, I reviewed cognitive diagnostic assessment (CDA) and mentioned
its advantages over other latent trait methods. I argued that the difficulty of a task can be accounted for by multiple factors or attributes1. Conventional unidimensional item response theory (IRT) models do not disseminate information concerning the factors attributing to task difficulty. On the other hand, the
fusion model - which is a CDA model - partitions the difficulty parameter so as to furnish fine-grained information about the tasks and test takers’ ability level. I further argued that granularity of the attributes is determined by researchers. In this installment the application of the fusion model to a
while-listening performance (WLP) test of is described.
The most important property of a measurement tool is the validity of uses and interpretation of its scores. Test developers attempt to establish validity by exploiting different techniques. Conventionally, validity has subsumed content,... more
The most important property of a measurement tool is the validity of uses and interpretation of its scores. Test developers attempt to establish validity by exploiting different techniques. Conventionally, validity has subsumed content, criterion, predictive, and construct classes. However, the new argument-based approach to validity adheres to the use and interpretations of test scores. The argument-based approach to validity has been introduced to the field by Kane (1992, 2001, 2002, 2004, 2006) (see also Mislevy, Steinberg, & Almond, 2003; Kane, Crooks, & Cohen, 1999; Mislevy, 2003; Koenig and Bachman, 2004; Bachman, 2005).
This study investigates the psychometric quality of a placement tool to assess English as a second language (ESL) writing. The author proposes an ESL writing model comprising four major facets: examinees, raters, tasks, and scoring... more
This study investigates the psychometric quality of a placement tool to assess English as a second language (ESL) writing. The author proposes an ESL writing model comprising four major facets: examinees, raters, tasks, and scoring criteria. The model has five scoring criteria: relevance and adequacy of content, compositional organization, cohesion, adequacy of vocabulary, and grammar; these are evaluated using a seven-point scoring rubric. The data were subjected to many-facets Rasch analysis which showed that the facets adapted in the present study functioned according to the expectations of the Rasch model in a few cases, but further studies should address the psychometric properties of the rating scale which might be a major cause of central tendency error.
Several studies have evaluated sentence structure and vocabulary (SSV) as a scoring criterion in assessing writing, but no consensus on its functionality has been reached. The present study presents evidence that this scoring criterion... more
Several studies have evaluated sentence structure and vocabulary (SSV) as a scoring criterion in assessing writing, but no consensus on its functionality has been reached. The present study presents evidence that this scoring criterion may not be appropriate in writing assessment. Scripts by 182 ESL students at two language centers were analyzed with the Rasch partial credit model. Although other scoring criteria functioned satisfactorily, SSV scores did not fit the Rasch model, and analysis of residuals showed SSV scoring on most test prompts loaded on a benign secondary dimension. The study proposes that a lexico-grammatical scoring criterion has potentially conflicting properties, and therefore recommends considering separate vocabulary and grammar criteria in writing assessment.
This paper integrates the Rasch validity model (Wright & Stone, 1988, 1999) into the argument-based validity framework (Kane, 1992, 2004). The Rasch validity subsumes fit and order validity. Order validity has two subcategories: meaning... more
This paper integrates the Rasch validity model (Wright & Stone, 1988, 1999) into the argument-based validity framework (Kane, 1992, 2004). The Rasch validity subsumes fit and order validity. Order validity has two subcategories: meaning validity (originated from the calibration of test variables) and utility validity (based on the calibration of persons to implement criterion validity). Fit validity concerns the consistency of response patterns. From 1) analysis of residuals, i.e., the difference between the Rasch model and the responses, 2) analysis of item fit, which can help revising the test, and 3) analysis of person fit, which can help diagnosing the testees whose performance do not fit our expectations, we get response, item function, and person performance validity, respectively....

And 14 more

Though significant discussions in writing assessment literature focus on understanding the relationship between the quality of second language (L2) students’ texts measured by human judges and linguistic features identified by automated... more
Though significant discussions in writing assessment literature focus on understanding the relationship between the quality of second language (L2) students’ texts measured by human judges and linguistic features identified by automated rating engines such as Coh-Metrix, little attention (if any) has been given to assessing reflective essays presented as individual student blog posts in a tertiary level communication course. The present study examines the relationship between the linguistic features of the reflective blog posts of Asian university learners enrolled in a professional communication course as measured by Coh-Metrix and these posts’ quality as assessed by human raters in discrete assessments. Rather than using traditional linear regression methods, the data was subjected to classification and regression trees (CART) to address this specific research question as follows:

How might Coh-Metrix indices of linguistic features including lexical diversity, syntactic complexity, word frequency, and grammatical accuracy relate to the assessment of these reflection essays made by the instructor?

This study uses the data from 104 tertiary students enrolled in a communication module. They completed four writing tasks at four time points (i.e., Pre-Course, Mid-1-Course, Mid-2-Course, and End-Course), yielding 416 essays, which were marked holistically by both human raters and analyzed via Coh-Metrix. 84 linguistic features for each essay (including vocabulary sophistication, lexical diversity, syntactic sophistication, and cohesion statistics) were recorded.
            A description of the nature of the reflective blog posts will be presented, along with the rationale for this study, more details on the methodology used for analyzing each post and preliminary findings. It will also be argued that using CART modeling to predict essay quality from linguistic features is novel. Unlike linear regression models, CART relaxes normality assumption, optimizing the predictive power of the analysis.
In a series of YouTube videos, I provide systematic guidelines for using SPSS, WINSTEPS, and other statistical software and interpreting their output. Please subscribe to receive notifications when new videos are released:... more
In a series of YouTube videos, I provide systematic guidelines for using SPSS, WINSTEPS, and other statistical software and interpreting their output. Please subscribe to receive notifications when new videos are released:

https://www.youtube.com/channel/UCfu2GCdjq50W-kL-cv3rcLw?view_as=subscriber
Research Interests:
Special Issue on Research into Learner Listening
Guest Editors: Christine C. M. Goh and Vahid Aryadoust
Research Interests:
Special Issue on Using Assessment Tasks for Improving Second Language Writing

EDUCATIONAL PSYCHOLOGY
AN INTERNATIONAL JOURNAL OF EXPERIMENTAL EDUCATIONAL PSYCHOLOGY
Research Interests:
It has been argued that item difficulty can affect the fit of a confirmatory factor analysis (CFA) model (McLeod, Swygert, & Thissen, 2001; Sawaki, Sticker, & Andreas, 2009). We explored the effect of items with outlying difficulty... more
It has been argued that item difficulty can affect the fit of a confirmatory factor analysis (CFA) model (McLeod, Swygert, & Thissen, 2001; Sawaki, Sticker, & Andreas, 2009). We explored the effect of items with outlying difficulty measures on the CFA model of the listening module of International English Language Testing System (IELTS). The test has four sections comprising 40 items altogether (10 items in each section). Each section measures a different listening skill making the test a conceptually four-dimensional assessment instrument...
Paper presented at the fourth ALTE, Krakow, Poland.
Research into the psychological and cognitive aspects of language learning, and second language (L2) learning in particular, demands new measurement tools that provide highly detailed information about language learners’ progress and... more
Research into the psychological and cognitive aspects of language learning, and second language (L2) learning in particular, demands new measurement tools that provide highly detailed information about language learners’ progress and proficiency. A new development in measurement models is Cognitive Diagnostic Assessment (CDA), which helps language assessment researchers evaluate students’ mastery of specific language sub-skills with greater specificity than other item response theory models. This paper discusses the tenets of CDA models in general and the fusion model (FM) in particular, and reports the results of a study applying the FM to lecture-comprehension section of the International English Language Testing System (IELTS) listening module. FM separates only two major listening sub-skills (i.e., the ability to understand explicitly stated information and make close paraphrases), likely indicating construct-underrepresentation. It also provides a master / non-mastery profile of test takers. Implications for assessing listening comprehension and IELTS are discussed.
The application of MFRM to writing tests of English.
factor structure of the IELTS listening test
This report reviews three prominent conceptualizations of validity (i.e., Embretson, 1983; Kane, 2002; Messick, 1989) to lay out a validity argument (VA) for the International English Proficiency Test (IEPT). To build and support the VA... more
This report reviews three prominent conceptualizations of validity (i.e., Embretson, 1983; Kane, 2002; Messick, 1989) to lay out a validity argument (VA) for the International English Proficiency Test (IEPT). To build and support the VA for the IEPT, we endorse Kane’s (2004, 2006, 2012) conceptualization which defines validity as a two-stage undertaking: making claims about the uses and interpretations of the scores (or interpretive argument) and evaluating the claims (or VA). The report further proposes several rigorous research methods and psychometric models to support the VA. The document, however, does not compare these concepts. For further information, readers are referred to Aryadoust (forthcoming).
Researchers have recently shown an increased interest in examining the link between the assessment of second language (ESL) students’ written texts by human raters and assessment by automated rating machines. Studies show that, depending... more
Researchers have recently shown an increased interest in examining the link between the assessment of second language (ESL) students’ written texts by human raters and assessment by automated rating machines. Studies show that, depending on their training and background, human raters are relatively reliable in writing assessment (Weigle, 2002). However, human ratings need double-marking and have logistical limitations. To overcome these constraints, researchers have recently turned to automated rating machines. Rating machines are economical, and have become increasingly more reliable as a result of recent developments in applied linguistics and computer science....
Research Interests:
Several studies have evaluated sentence structure and vocabulary (SSV) as a scoring criterion in assessing writing, but no consensus on its functionality has been reached. The present study presents evidence that this scoring criterion... more
Several studies have evaluated sentence structure and vocabulary (SSV) as a scoring criterion in assessing writing, but no consensus on its functionality has been reached. The present study presents evidence that this scoring criterion may not be appropriate in writing assessment. Scripts by 182 ESL students at two language centers were analyzed with the Rasch partial credit model. Although other scoring criteria functioned satisfactorily, SSV scores did not fit the Rasch model, and analysis of residuals showed SSV scoring on most test prompts loaded on a benign secondary dimension. The study proposes that a lexico-grammatical scoring criterion has potentially conflicting properties, and therefore recommends considering separate vocabulary and grammar criteria in writing assessment.
Although language assessment and testing can be viewed as having a much longer history (Spolsky, 2017; Farhady, 2018), its genesis as a research field is often attributed to Carroll’s (1961) and Lado’s (1961) publications. Over the past... more
Although language assessment and testing can be viewed as having a much longer history (Spolsky, 2017; Farhady, 2018), its genesis as a research field is often attributed to Carroll’s (1961) and Lado’s (1961) publications. Over the past decades, the field has gradually grown in scope and sophistication as researchers have adopted various interdisciplinary approaches to problematize and address old and new issues in language assessment as well as learning. The assessment and validation of reading, listening, speaking, and writing, as well as language elements such as vocabulary and grammar have formed the basis of extensive studies (e.g., Chapelle, 2008). Emergent research areas in the field include the assessment of sign languages (Kotowicz et al., 2021). In addition, researchers have employed a variety of psychometric and statistical methods to investigate research questions and hypotheses (see chapters in Aryadoust and Raquel, 2019, 2020). The present special issue entitled “Front...
Test fairness has been recognised as a fundamental requirement of test validation. Two quantitative approaches to investigate test fairness, the Rasch-based differential item functioning (DIF) detection method and a measurement invariance... more
Test fairness has been recognised as a fundamental requirement of test validation. Two quantitative approaches to investigate test fairness, the Rasch-based differential item functioning (DIF) detection method and a measurement invariance technique called multiple indicators, multiple causes (MIMIC), were adopted and compared in a test fairness study of the Pearson Test of English (PTE) Academic Reading test (n = 783). The Rasch partial credit model (PCM) showed no statistically significant uniform DIF across gender and, similarly, the MIMIC analysis showed that measurement invariance was maintained in the test. However, six pairs of significant non-uniform DIF (p &lt; 0.05) were found in the DIF analysis. A discussion of the results and post-hoc content analysis is presented and the theoretical and practical implications of the study for test developers and language assessment are discussed.
This study evaluated the validity of the Michigan English Test (MET) Listening Section by investigating its underlying factor structure and the replicability of its factor structure across multiple...
Social interactions accompany individuals throughout their whole lives. When examining the underlying mechanisms of social processes, dynamics of synchrony, coordination or attunement emerge between individuals at multiple levels. To... more
Social interactions accompany individuals throughout their whole lives. When examining the underlying mechanisms of social processes, dynamics of synchrony, coordination or attunement emerge between individuals at multiple levels. To identify the impactful publications that studied such mechanisms and establishing the trends that dynamically originated the available literature, the current study adopted a scientometric approach. A sample of 543 documents dated from 1971 to 2021 was derived from Scopus. Subsequently, a document co-citation analysis was conducted on 29,183 cited references to examine the patterns of co-citation among the documents. The resulting network consisted of 1,759 documents connected to each other by 5,011 links. Within the network, five major clusters were identified. The analysis of the content of the three major clusters—namely, “Behavioral synchrony,” “Towards bio-behavioral synchrony,” and “Neural attunement”—suggests an interest in studying attunement in...
This study investigates the underlying structure of the listening test of the Singapore–Cambridge General Certificate of Education (GCE) exam, comparing the fit of five cognitive diagnostic assessment models comprising the deterministic... more
This study investigates the underlying structure of the listening test of the Singapore–Cambridge General Certificate of Education (GCE) exam, comparing the fit of five cognitive diagnostic assessment models comprising the deterministic input noisy “and” gate (DINA), generalized DINA (G-DINA), deterministic input noisy “or” gate (DINO), higher-order DINA (HO-DINA), and the reduced reparameterized unified model (RRUM). Through model-comparisons, a nine-subskill RRUM model was found to possess the optimal fit. This study shows that students’ listening test performance depends on an array of test-specific facets, such as the ability to eliminate distractors in multiple-choice questions alongside listening-specific subskills such as the ability to make inferences. The validated list of the listening subskills can be employed as a useful guideline to prepare students for the GCE listening test at schools.
This article proposes an integrated cognitive theory of reading and listening that draws on a maximalist account of comprehension and emphasizes the role of bottom-up and top-down processing. The theoretical framework draws on the... more
This article proposes an integrated cognitive theory of reading and listening that draws on a maximalist account of comprehension and emphasizes the role of bottom-up and top-down processing. The theoretical framework draws on the findings of previous research and integrates them into a coherent and plausible narrative to explain and predict the comprehension of written and auditory inputs. The theory is accompanied by a model that schematically represents the fundamental components of the theory and the comprehension mechanisms described. The theory further highlights the role of perception and word recognition (underresearched in reading research), situation models (missing in listening research), mental imagery (missing in both streams), and inferencing. The robustness of the theory is discussed in light of the principles of scientific theories adopted from Popper (1959).
The effectiveness of a language test to meaningfully diagnose a learner’s language proficiency remains in some doubt. Alderson (2005) claims that diagnostic tests are superficial because they do not inform learners what they need to do in... more
The effectiveness of a language test to meaningfully diagnose a learner’s language proficiency remains in some doubt. Alderson (2005) claims that diagnostic tests are superficial because they do not inform learners what they need to do in order to develop; “they just identify strengths and weaknesses and their remediation” (p. 1). In other words, a test cannot claim to be diagnostic unless it facilitates language development in the learner. In response to the perceived need for a mechanism to both provide diagnostic information and specific language support, four Hong Kong universities have developed the Diagnostic English Language Tracking Assessment (DELTA), which could be said to be meaningfully diagnostic because it is both integrated into the English language learning curriculum and used in combination with follow-up learning resources to guide independent learning.
The purpose of the present study was twofold: (a) it examined the relationship between peer-rated likeability and peer-rated oral presentation skills of 96 student presenters enrolled in a science communication course, and (b) it... more
The purpose of the present study was twofold: (a) it examined the relationship between peer-rated likeability and peer-rated oral presentation skills of 96 student presenters enrolled in a science communication course, and (b) it investigated the relationship between student raters’ severity in rating presenters’ likeability and their severity in evaluating presenters’ skills. Students delivered an academic presentation and then changed roles to rate their peers’ performance and likeability, using an 18-item oral presentation scale and a 10-item likeability questionnaire, respectively. Many-facet Rasch measurement was used to validate the data, and structural equation modeling (SEM) was used to examine the research questions. At an aggregate level, likeability explained 19.5% of the variance of the oral presentation ratings and 8.4% of rater severity. At an item-level, multiple cause-effect relationships were detected, with the likeability items explaining 6–30% of the variance in the oral presentation items. Implications of the study are discussed.
Research Interests:
Validity evidence is provided for a Persian blog attitude questionnaire (P-BAQ). P-BAQ was administered to 565 Iranians and factor analysis and rating scale model identified affective, behavioral, and perseverance, and confidence... more
Validity evidence is provided for a Persian blog attitude questionnaire (P-BAQ). P-BAQ was administered to 565 Iranians and factor analysis and rating scale model identified affective, behavioral, and perseverance, and confidence dimensions underlying the data. P-BAQ’s validity argument was supported by the theoretical and psychometric evidence, although adding a few items to the instrument would improve its construct representativeness.
Recommended Citation

Aryadoust, Vahid and Shahsavar, Zahra (2016) "Validity of the Persian Blog Attitude Questionnaire: An Evidence-Based Approach," Journal of Modern Applied Statistical Methods: Vol. 15: Iss. 1, Article 22.
Available at: http://digitalcommons.wayne.edu/jmasm/vol15/iss1/22
Research Interests:
This study aims to examine the relationship between reading comprehension and lexical and grammatical knowledge among English as a foreign language students by using an Artificial Neural Network (ANN). There were 825 test takers... more
This study aims to examine the relationship between reading comprehension and lexical and grammatical knowledge among English as a foreign language students by using an Artificial Neural Network (ANN). There were 825 test takers administered both a second-language reading test and a set of psychometrically validated grammar and vocabulary tests. Next, their reading, grammar, and vocabulary abilities were estimated by the Rasch model. A multilayer ANN was used to classify low- and high-ability readers based on their grammar and vocabulary measures. ANN accurately classified approximately 78% of readers with reference to their vocabulary and grammar knowledge. This finding is consistent with the cognitive theories of reading that treat the lexical and grammatical knowledge of learners as amajor factor in distinguishing poor from competent readers. The study also confirmed previous research in finding that vocabulary knowledge was associated with reading comprehension more strongly than grammatical knowledge.
Research Interests:
Research Interests:
Learner Listening: New Insights and Directions from Empirical Studies
Research Interests:
The Kurdish Language is mainly spoken in Iran, Iraq, Turkey, and Syria. Some dialects of this language still possess similar features this language has had in the past, among them Hawrami which is mainly spoken in western Iran (along with... more
The Kurdish Language is mainly spoken in Iran, Iraq, Turkey, and Syria. Some dialects of this language still possess similar features this language has had in the past, among them Hawrami which is mainly spoken in western Iran (along with other areas ...
Modelling listening item difficulty remains a challenge to this day. Latent trait models such as the Rasch model used to predict the outcomes of test takers’ performance on test items have been criticized as “thin on substantive theory”... more
Modelling listening item difficulty remains a challenge to this day. Latent trait models such as the Rasch model used to predict the outcomes of test takers’ performance on test items have been criticized as “thin on substantive theory” (Stenner, Stone, & Burdick, 2011, p.3). The use of regression models to predict item difficulty also has its limitations because linear regression assumes linearity and normality of data which, if violated, results in a lack of fit. In addition, classification and regression trees (CART), despite their rigorous algorithm, do not always yield a stable tree structure (Breiman, 2001). Another problem pertains to the operationalization of dependent variables. Researchers have relied on content specialists or verbal protocols elicited from test takers to determine the variables predicting item difficulty. However, even though content specialists are highly competent, they may not be able to determine precisely the lower-level comprehension processes used ...
This article reports on the development and administration of the Academic Listening Self-rating Questionnaire (ALSA). The ALSA was developed on the basis of a proposed model of academic listening comprising six related components. The... more
This article reports on the development and administration of the Academic Listening Self-rating Questionnaire (ALSA). The ALSA was developed on the basis of a proposed model of academic listening comprising six related components. The researchers operationalized the model, subjected items to iterative rounds of content analysis, and administered the finalized questionnaire to inter-national ESL (English as a second language) students in Malaysian and Australian universities. Structural equation modeling and rating scale modeling of data provided content-related, substan-tive, and structural validity evidence for the instrument. The researchers explain the utility of the questionnaire for educational and assessment purposes.
ABSTRACT Research into the psychological and cognitive aspects of language learning, and second language (L2) learning in particular, demands new measurement tools that provide highly detailed information about language learners’ progress... more
ABSTRACT Research into the psychological and cognitive aspects of language learning, and second language (L2) learning in particular, demands new measurement tools that provide highly detailed information about language learners’ progress and proficiency. A new development in measurement models is Cognitive Diagnostic Assessment (CDA), which helps language assessment researchers evaluate students’ mastery of specific language sub-skills with greater specificity than other item response theory models. This paper discusses the tenets of CDA models in general and the fusion model (FM) in particular, and reports the results of a study applying the FM to lecture-comprehension section of a practice version the International English Language Testing System (IELTS) listening module. FM separates only two major listening sub-skills (i.e., the ability to understand explicitly stated information and make close paraphrases), likely indicating construct-underrepresentation. It also provides a master / non-mastery profile of test takers. Implications for assessing listening comprehension and IELTS are discussed.