Several techniques exist to automatically put together a test meeting a number of specifications.... more Several techniques exist to automatically put together a test meeting a number of specifications. In an item bank, the items are stored with their characteristics. A test is constructed by selecting a set of items that fulfills the specifications set by the test assembler. Test assembly problems are often formulated in terms of a model consisting of restrictions and an
Unstructured textual data such as students’ essays and life narratives can provide helpful inform... more Unstructured textual data such as students’ essays and life narratives can provide helpful information in educational and psychological measurement, but often contain irregularities and ambiguities, which creates difficulties in analysis. Text mining techniques that seek to extract useful information from textual data sources through identifying interesting patterns are promising. This chapter describes the general procedures of text classification using text
Standardized tests often group items around a common stimulus. Such groupings of items are called... more Standardized tests often group items around a common stimulus. Such groupings of items are called testlets. The potential dependency among items within a testlet is generally ignored in practice, even though a basic assumption of item response theory (IRT) is that individual items are independent of one another. A technique called tree-based regression (TBR) was applied to identify key features
Item response theory is usually applied to items with a selected-response format, such as multipl... more Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a selected-response format. This chapter presents a short overview of how item response theory and generalizability theory were
Driven by the technological revolution, computer-based testing (CBT) has witnessed an explosive r... more Driven by the technological revolution, computer-based testing (CBT) has witnessed an explosive rise the last decades, in both psychological and educational assessment. Many paper-and-pencil tests now have a computer-based equivalent. Innovations in CBT are almost innumerable, and innovative and new CBTs continue to emerge on a very regular basis. Innovations in CBT may best be described along a continuum of
Classification accuracy of single exams is well studied in the educational measurement literature... more Classification accuracy of single exams is well studied in the educational measurement literature. However, when making important decisions, such as certification decisions, one usually uses several exams: an exam set. This chapter elaborates on classification accuracy of exam sets. This is influenced by the shape of the ability distribution, the height of the standards, and the possibility for compensation. This
The case of adaptive testing under a multidimensional response model with large numbers of constr... more The case of adaptive testing under a multidimensional response model with large numbers of constraints on the content of the test is addressed. The items in the test are selected using a shadow test approach. The 0–1 linear programming model that assembles the shadow tests maximizes posterior expected Kullback-Leibler information in the test. The procedure is illustrated for five different
ABSTRACT http://authors.elsevier.com/a/1Qu~h1Hucd1YX5 This study examines the effect of regulatio... more ABSTRACT http://authors.elsevier.com/a/1Qu~h1Hucd1YX5 This study examines the effect of regulation feedback in a computer-based formative assessment in the context of searching for information online. Fifty 13-year-old students completed two randomly selected assessment tasks, receiving automated regulation feedback between them. Student performance was (self-)graded by students and by experts. Expert, as well as student (self)grades showed a significant increase between Task 1 and Task 2. However, further analysis of the expert grades showed significant improvement in performance for girls only. Furthermore, the formative assessment system traced the number of searches and the number of websites consulted per student to complete the two assignments. On average, the results showed that students consulted significantly more websites for Task 2, compared to Task 1. The average number of searches did not differ significantly between Tasks 1 and 2. On the other hand, significant differences were found for those students who, during the evaluation of their performance on Task 1, explicitly stated that they would increase their searches.
In the different tests and examinations that are used at a national level in the Netherlands, a v... more In the different tests and examinations that are used at a national level in the Netherlands, a variety of equating and linking procedures are applied to maintain assessment standards. This chapter presents an overview of potential sources of information that can be used in the standard setting of tests and examinations. Examples from test practices in the Netherlands are provided that apply some of these sources of information. This chapter discusses how the different sources of information are applied and aggregated to set the levels. It also discusses under which circumstances performance information of the population would be sufficient to set the levels and when additional information is necessary
High-stakes tests often consist of sets of questions (i.e., items) grouped around a common stimul... more High-stakes tests often consist of sets of questions (i.e., items) grouped around a common stimulus. Such groupings of items are often called testlets. A basic assumption of item response theory (IRT), the mathematical model commonly used in the analysis of test data, is that individual items are independent of one another. The potential dependency among items within a testlet is often ignored in practice. In this study, a technique called tree-based regression (TBR) was applied to identify key features of stimuli that could properly predict the dependence structure of testlet data for the Analytical Reasoning section of a high-stakes test. Relevant features identified included Percentage of “If” Clauses, Number of Entities, Theme/Topic, and Predicate Propositional Density; the testlet effect was smallest for stimuli that contained 31% or fewer “if” clauses, contained 9.8% or fewer verbs, and had Media or Animals as the main theme. This study illustrates the merits of TBR in the ana...
It is quite common that tests or exams are being used for more then one purpose. First of all, th... more It is quite common that tests or exams are being used for more then one purpose. First of all, they are used to measure the ability of the students in a reliable manner. Besides, they can be used for pass/fail decisions or to predict future behavior of the candidate, like future job behavior or academic performance. The question remains how to assemble a test that can be used for all these different purposes, that is, how to assemble a multi-objective test. Besides, multiple objectives can result from different purposes, but also from the way test specifications have been implemented. For the WDM-model, for multidimensional IRT, for Cognitive Diagnostic CAT, but also for infeasibility analysis, multiple objective test assembly problems have to be solved. In this paper, a 2-stage method is presented for dealing with multiple objectives in test assembly. In the normalization stage, all objectives are brought on a common scale. In the valorization stage, the different objectives are be...
In existing adaptive testing programs, each successive item in the test is chosen to optimize an ... more In existing adaptive testing programs, each successive item in the test is chosen to optimize an objective. Examples of well-known objectives aremaximizing the information in the test at the ability estimate for the test taker or minimizing the deviation of its information froma target value at the estimate. In addition, item selection is required to realize a set of content specifications for the test. For example, item content may be required to follow a certain taxonomy or the answer-key distribution for the test must not deviate too much from uniformity. Content specifications are generally defined in terms of combinations of attributes the items in the test should have. They are typically realized by imposing a set of constraints on the item-selection process. The presence of both an objective and a set of constraints in adaptive testing leads to the notion of adaptive testing as constrained (sequential) optimization problem; for a more formal introduction to this notion, see van der Linden (this volume, chap. 2).
Computerized Adaptive Testing: Theory and Practice, 2000
In existing computerized adaptive testing (CAT) programs, each successive item in the test is cho... more In existing computerized adaptive testing (CAT) programs, each successive item in the test is chosen to optimize an objective function. Examples of well-known objectives in CAT are maximizing the information in the test at the ability estimate for the examinee and ...
This study concerns the comparison of three approaches to assessment: Data-Driven Decision Making... more This study concerns the comparison of three approaches to assessment: Data-Driven Decision Making, Assessment for Learning, and Diagnostic Testing. Although the three approaches claim to be beneficial with regard to student learning, no clear study into the relationships and distinctions between these approaches exists to date. The goal of this study was to investigate the extent to which the three approaches can be shaped into an integrative formative approach towards assessment. The three approaches were compared on nine characteristics of assessment. The results suggest that although the approaches seem to be contradictory with respect to some characteristics, it is argued that they could complement each other despite these differences. The researchers discuss how the three approaches can be shaped into an integrative formative approach towards assessment
Several techniques exist to automatically put together a test meeting a number of specifications.... more Several techniques exist to automatically put together a test meeting a number of specifications. In an item bank, the items are stored with their characteristics. A test is constructed by selecting a set of items that fulfills the specifications set by the test assembler. Test assembly problems are often formulated in terms of a model consisting of restrictions and an
Unstructured textual data such as students’ essays and life narratives can provide helpful inform... more Unstructured textual data such as students’ essays and life narratives can provide helpful information in educational and psychological measurement, but often contain irregularities and ambiguities, which creates difficulties in analysis. Text mining techniques that seek to extract useful information from textual data sources through identifying interesting patterns are promising. This chapter describes the general procedures of text classification using text
Standardized tests often group items around a common stimulus. Such groupings of items are called... more Standardized tests often group items around a common stimulus. Such groupings of items are called testlets. The potential dependency among items within a testlet is generally ignored in practice, even though a basic assumption of item response theory (IRT) is that individual items are independent of one another. A technique called tree-based regression (TBR) was applied to identify key features
Item response theory is usually applied to items with a selected-response format, such as multipl... more Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a selected-response format. This chapter presents a short overview of how item response theory and generalizability theory were
Driven by the technological revolution, computer-based testing (CBT) has witnessed an explosive r... more Driven by the technological revolution, computer-based testing (CBT) has witnessed an explosive rise the last decades, in both psychological and educational assessment. Many paper-and-pencil tests now have a computer-based equivalent. Innovations in CBT are almost innumerable, and innovative and new CBTs continue to emerge on a very regular basis. Innovations in CBT may best be described along a continuum of
Classification accuracy of single exams is well studied in the educational measurement literature... more Classification accuracy of single exams is well studied in the educational measurement literature. However, when making important decisions, such as certification decisions, one usually uses several exams: an exam set. This chapter elaborates on classification accuracy of exam sets. This is influenced by the shape of the ability distribution, the height of the standards, and the possibility for compensation. This
The case of adaptive testing under a multidimensional response model with large numbers of constr... more The case of adaptive testing under a multidimensional response model with large numbers of constraints on the content of the test is addressed. The items in the test are selected using a shadow test approach. The 0–1 linear programming model that assembles the shadow tests maximizes posterior expected Kullback-Leibler information in the test. The procedure is illustrated for five different
ABSTRACT http://authors.elsevier.com/a/1Qu~h1Hucd1YX5 This study examines the effect of regulatio... more ABSTRACT http://authors.elsevier.com/a/1Qu~h1Hucd1YX5 This study examines the effect of regulation feedback in a computer-based formative assessment in the context of searching for information online. Fifty 13-year-old students completed two randomly selected assessment tasks, receiving automated regulation feedback between them. Student performance was (self-)graded by students and by experts. Expert, as well as student (self)grades showed a significant increase between Task 1 and Task 2. However, further analysis of the expert grades showed significant improvement in performance for girls only. Furthermore, the formative assessment system traced the number of searches and the number of websites consulted per student to complete the two assignments. On average, the results showed that students consulted significantly more websites for Task 2, compared to Task 1. The average number of searches did not differ significantly between Tasks 1 and 2. On the other hand, significant differences were found for those students who, during the evaluation of their performance on Task 1, explicitly stated that they would increase their searches.
In the different tests and examinations that are used at a national level in the Netherlands, a v... more In the different tests and examinations that are used at a national level in the Netherlands, a variety of equating and linking procedures are applied to maintain assessment standards. This chapter presents an overview of potential sources of information that can be used in the standard setting of tests and examinations. Examples from test practices in the Netherlands are provided that apply some of these sources of information. This chapter discusses how the different sources of information are applied and aggregated to set the levels. It also discusses under which circumstances performance information of the population would be sufficient to set the levels and when additional information is necessary
High-stakes tests often consist of sets of questions (i.e., items) grouped around a common stimul... more High-stakes tests often consist of sets of questions (i.e., items) grouped around a common stimulus. Such groupings of items are often called testlets. A basic assumption of item response theory (IRT), the mathematical model commonly used in the analysis of test data, is that individual items are independent of one another. The potential dependency among items within a testlet is often ignored in practice. In this study, a technique called tree-based regression (TBR) was applied to identify key features of stimuli that could properly predict the dependence structure of testlet data for the Analytical Reasoning section of a high-stakes test. Relevant features identified included Percentage of “If” Clauses, Number of Entities, Theme/Topic, and Predicate Propositional Density; the testlet effect was smallest for stimuli that contained 31% or fewer “if” clauses, contained 9.8% or fewer verbs, and had Media or Animals as the main theme. This study illustrates the merits of TBR in the ana...
It is quite common that tests or exams are being used for more then one purpose. First of all, th... more It is quite common that tests or exams are being used for more then one purpose. First of all, they are used to measure the ability of the students in a reliable manner. Besides, they can be used for pass/fail decisions or to predict future behavior of the candidate, like future job behavior or academic performance. The question remains how to assemble a test that can be used for all these different purposes, that is, how to assemble a multi-objective test. Besides, multiple objectives can result from different purposes, but also from the way test specifications have been implemented. For the WDM-model, for multidimensional IRT, for Cognitive Diagnostic CAT, but also for infeasibility analysis, multiple objective test assembly problems have to be solved. In this paper, a 2-stage method is presented for dealing with multiple objectives in test assembly. In the normalization stage, all objectives are brought on a common scale. In the valorization stage, the different objectives are be...
In existing adaptive testing programs, each successive item in the test is chosen to optimize an ... more In existing adaptive testing programs, each successive item in the test is chosen to optimize an objective. Examples of well-known objectives aremaximizing the information in the test at the ability estimate for the test taker or minimizing the deviation of its information froma target value at the estimate. In addition, item selection is required to realize a set of content specifications for the test. For example, item content may be required to follow a certain taxonomy or the answer-key distribution for the test must not deviate too much from uniformity. Content specifications are generally defined in terms of combinations of attributes the items in the test should have. They are typically realized by imposing a set of constraints on the item-selection process. The presence of both an objective and a set of constraints in adaptive testing leads to the notion of adaptive testing as constrained (sequential) optimization problem; for a more formal introduction to this notion, see van der Linden (this volume, chap. 2).
Computerized Adaptive Testing: Theory and Practice, 2000
In existing computerized adaptive testing (CAT) programs, each successive item in the test is cho... more In existing computerized adaptive testing (CAT) programs, each successive item in the test is chosen to optimize an objective function. Examples of well-known objectives in CAT are maximizing the information in the test at the ability estimate for the examinee and ...
This study concerns the comparison of three approaches to assessment: Data-Driven Decision Making... more This study concerns the comparison of three approaches to assessment: Data-Driven Decision Making, Assessment for Learning, and Diagnostic Testing. Although the three approaches claim to be beneficial with regard to student learning, no clear study into the relationships and distinctions between these approaches exists to date. The goal of this study was to investigate the extent to which the three approaches can be shaped into an integrative formative approach towards assessment. The three approaches were compared on nine characteristics of assessment. The results suggest that although the approaches seem to be contradictory with respect to some characteristics, it is argued that they could complement each other despite these differences. The researchers discuss how the three approaches can be shaped into an integrative formative approach towards assessment
Uploads
Papers by Bernard Veldkamp