Skip to main content

Bernard Veldkamp

University of Twente, Behavioural, Management and Social Sciences, Faculty Member

Followers

38

Following

10

Co-authors

9

Public Views

Interests

Uploads

Papers by Bernard Veldkamp

Infeasibility in Automated Test Assembly Models: A Comparison Study of Different Method

Journal of Educational Measurement, 2005

Several techniques exist to automatically put together a test meeting a number of specifications.... more Several techniques exist to automatically put together a test meeting a number of specifications. In an item bank, the items are stored with their characteristics. A test is constructed by selecting a set of items that fulfills the specifications set by the test assembler. Test assembly problems are often formulated in terms of a model consisting of restrictions and an

Classifying Unstructured Textual Data Using the Product Score Model: An Alternative Text Mining Algorithm

BMC Medical Informatics and Decision Making, 2012

Unstructured textual data such as students’ essays and life narratives can provide helpful inform... more Unstructured textual data such as students’ essays and life narratives can provide helpful information in educational and psychological measurement, but often contain irregularities and ambiguities, which creates difficulties in analysis. Text mining techniques that seek to extract useful information from textual data sources through identifying interesting patterns are promising. This chapter describes the general procedures of text classification using text

Minimizing the testlet effect: identifying critical testlet features by means of tree-based regression

BMC Medical Informatics and Decision Making, 2012

Standardized tests often group items around a common stimulus. Such groupings of items are called... more Standardized tests often group items around a common stimulus. Such groupings of items are called testlets. The potential dependency among items within a testlet is generally ignored in practice, even though a basic assumption of item response theory (IRT) is that individual items are independent of one another. A technique called tree-based regression (TBR) was applied to identify key features

Generalizability theory and item response theory

by Cees Glas and Bernard Veldkamp

BMC Medical Informatics and Decision Making, 2012

Item response theory is usually applied to items with a selected-response format, such as multipl... more Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a selected-response format. This chapter presents a short overview of how item response theory and generalizability theory were

An Overview of Innovative Computer-Based Testing

BMC Medical Informatics and Decision Making, 2012

Driven by the technological revolution, computer-based testing (CBT) has witnessed an explosive r... more Driven by the technological revolution, computer-based testing (CBT) has witnessed an explosive rise the last decades, in both psychological and educational assessment. Many paper-and-pencil tests now have a computer-based equivalent. Innovations in CBT are almost innumerable, and innovative and new CBTs continue to emerge on a very regular basis. Innovations in CBT may best be described along a continuum of

Influences on classification accuracy of exam sets: an example from vocational education and training

BMC Medical Informatics and Decision Making, 2012

Classification accuracy of single exams is well studied in the educational measurement literature... more Classification accuracy of single exams is well studied in the educational measurement literature. However, when making important decisions, such as certification decisions, one usually uses several exams: an exam set. This chapter elaborates on classification accuracy of exam sets. This is influenced by the shape of the ability distribution, the height of the standards, and the possibility for compensation. This

Infeasibility in Automated Test Assembly Models: A Comparison Study of Different Methods

by Angela Verschoor and Bernard Veldkamp

Multidimensional adaptive testing with constraints on test content

Psychometrika, 2002

The case of adaptive testing under a multidimensional response model with large numbers of constr... more The case of adaptive testing under a multidimensional response model with large numbers of constraints on the content of the test is addressed. The items in the test are selected using a shadow test approach. The 0–1 linear programming model that assembles the shadow tests maximizes posterior expected Kullback-Leibler information in the test. The procedure is illustrated for five different

A blending of computer-based assessment and performance-based assessment: Multimedia-Based Performance Assessment (MBPA). The introduction of a new method of assessment in Dutch Vocational Education and Training (VET)

CADMO, 2014

The effect of regulation feedback in a computer-based formative assessment on information problem solving

Computers & Education, 2015

ABSTRACT http://authors.elsevier.com/a/1Qu~h1Hucd1YX5 This study examines the effect of regulatio... more ABSTRACT http://authors.elsevier.com/a/1Qu~h1Hucd1YX5 This study examines the effect of regulation feedback in a computer-based formative assessment in the context of searching for information online. Fifty 13-year-old students completed two randomly selected assessment tasks, receiving automated regulation feedback between them. Student performance was (self-)graded by students and by experts. Expert, as well as student (self)grades showed a significant increase between Task 1 and Task 2. However, further analysis of the expert grades showed significant improvement in performance for girls only. Furthermore, the formative assessment system traced the number of searches and the number of websites consulted per student to complete the two assignments. On average, the results showed that students consulted significantly more websites for Task 2, compared to Task 1. The average number of searches did not differ significantly between Tasks 1 and 2. On the other hand, significant differences were found for those students who, during the evaluation of their performance on Task 1, explicitly stated that they would increase their searches.

ASSESSING FIT OF LATENT REGRESSION MODELS

by Bernard Veldkamp, Sandip Sinharay, and Matthias von Davier

ETS Research Report Series, 2009

Classifying Unstructured Textual Data Using the Product Score Model: An Alternative Text Mining Algorithm

Psychometrics in practice at RCEC, 2012

Use of different sources of information in maintaining standards: examples from the Netherlands

In the different tests and examinations that are used at a national level in the Netherlands, a v... more In the different tests and examinations that are used at a national level in the Netherlands, a variety of equating and linking procedures are applied to maintain assessment standards. This chapter presents an overview of potential sources of information that can be used in the standard setting of tests and examinations. Examples from test practices in the Netherlands are provided that apply some of these sources of information. This chapter discusses how the different sources of information are applied and aggregated to set the levels. It also discusses under which circumstances performance information of the population would be sufficient to set the levels and when additional information is necessary

Selecting Testlet Features With Predictive Value for the Testlet Effect: An Empirical Study

SAGE Open

High-stakes tests often consist of sets of questions (i.e., items) grouped around a common stimul... more High-stakes tests often consist of sets of questions (i.e., items) grouped around a common stimulus. Such groupings of items are often called testlets. A basic assumption of item response theory (IRT), the mathematical model commonly used in the analysis of test data, is that individual items are independent of one another. The potential dependency among items within a testlet is often ignored in practice. In this study, a technique called tree-based regression (TBR) was applied to identify key features of stimuli that could properly predict the dependence structure of testlet data for the Analytical Reasoning section of a high-stakes test. Relevant features identified included Percentage of “If” Clauses, Number of Entities, Theme/Topic, and Predicate Propositional Density; the testlet effect was smallest for stimuli that contained 31% or fewer “if” clauses, contained 9.8% or fewer verbs, and had Media or Animals as the main theme. This study illustrates the merits of TBR in the ana...

Dealing with multiple criteria in test assembly

by Bernard Veldkamp and Mariagiulia Matteucci

It is quite common that tests or exams are being used for more then one purpose. First of all, th... more It is quite common that tests or exams are being used for more then one purpose. First of all, they are used to measure the ability of the students in a reliable manner. Besides, they can be used for pass/fail decisions or to predict future behavior of the candidate, like future job behavior or academic performance. The question remains how to assemble a test that can be used for all these different purposes, that is, how to assemble a multi-objective test. Besides, multiple objectives can result from different purposes, but also from the way test specifications have been implemented. For the WDM-model, for multidimensional IRT, for Cognitive Diagnostic CAT, but also for infeasibility analysis, multiple objective test assembly problems have to be solved. In this paper, a 2-stage method is presented for dealing with multiple objectives in test assembly. In the normalization stage, all objectives are brought on a common scale. In the valorization stage, the different objectives are be...

Designing Item Pools for Adaptive Testing

Elements of Adaptive Testing, 2009

In existing adaptive testing programs, each successive item in the test is chosen to optimize an ... more In existing adaptive testing programs, each successive item in the test is chosen to optimize an objective. Examples of well-known objectives aremaximizing the information in the test at the ability estimate for the test taker or minimizing the deviation of its information froma target value at the estimate. In addition, item selection is required to realize a set of content specifications for the test. For example, item content may be required to follow a certain taxonomy or the answer-key distribution for the test must not deviate too much from uniformity. Content specifications are generally defined in terms of combinations of attributes the items in the test should have. They are typically realized by imposing a set of constraints on the item-selection process. The presence of both an objective and a set of constraints in adaptive testing leads to the notion of adaptive testing as constrained (sequential) optimization problem; for a more formal introduction to this notion, see van der Linden (this volume, chap. 2).

Designing Item Pools for Computerized Adaptive Testing

Computerized Adaptive Testing: Theory and Practice, 2000

In existing computerized adaptive testing (CAT) programs, each successive item in the test is cho... more

Towards an Integrative Formative Approach of Data-Driven Decision Making, Assessment for Learning, and Diagnostic Testing

This study concerns the comparison of three approaches to assessment: Data-Driven Decision Making... more This study concerns the comparison of three approaches to assessment: Data-Driven Decision Making, Assessment for Learning, and Diagnostic Testing. Although the three approaches claim to be beneficial with regard to student learning, no clear study into the relationships and distinctions between these approaches exists to date. The goal of this study was to investigate the extent to which the three approaches can be shaped into an integrative formative approach towards assessment. The three approaches were compared on nine characteristics of assessment. The results suggest that although the approaches seem to be contradictory with respect to some characteristics, it is argued that they could complement each other despite these differences. The researchers discuss how the three approaches can be shaped into an integrative formative approach towards assessment

Application of optimization methods to assessment design

by Bernard Veldkamp and A. Oranje

Psychometric analysis of the performance data of simulation-based assessment: A systematic review and a Bayesian network example

Computers & Education, 2015

Infeasibility in Automated Test Assembly Models: A Comparison Study of Different Method

Journal of Educational Measurement, 2005

Several techniques exist to automatically put together a test meeting a number of specifications.... more Several techniques exist to automatically put together a test meeting a number of specifications. In an item bank, the items are stored with their characteristics. A test is constructed by selecting a set of items that fulfills the specifications set by the test assembler. Test assembly problems are often formulated in terms of a model consisting of restrictions and an

Classifying Unstructured Textual Data Using the Product Score Model: An Alternative Text Mining Algorithm

BMC Medical Informatics and Decision Making, 2012

Unstructured textual data such as students’ essays and life narratives can provide helpful inform... more Unstructured textual data such as students’ essays and life narratives can provide helpful information in educational and psychological measurement, but often contain irregularities and ambiguities, which creates difficulties in analysis. Text mining techniques that seek to extract useful information from textual data sources through identifying interesting patterns are promising. This chapter describes the general procedures of text classification using text

Minimizing the testlet effect: identifying critical testlet features by means of tree-based regression

BMC Medical Informatics and Decision Making, 2012

Standardized tests often group items around a common stimulus. Such groupings of items are called... more Standardized tests often group items around a common stimulus. Such groupings of items are called testlets. The potential dependency among items within a testlet is generally ignored in practice, even though a basic assumption of item response theory (IRT) is that individual items are independent of one another. A technique called tree-based regression (TBR) was applied to identify key features

Generalizability theory and item response theory

by Cees Glas and Bernard Veldkamp

BMC Medical Informatics and Decision Making, 2012

Item response theory is usually applied to items with a selected-response format, such as multipl... more Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a selected-response format. This chapter presents a short overview of how item response theory and generalizability theory were

An Overview of Innovative Computer-Based Testing

BMC Medical Informatics and Decision Making, 2012

Driven by the technological revolution, computer-based testing (CBT) has witnessed an explosive r... more Driven by the technological revolution, computer-based testing (CBT) has witnessed an explosive rise the last decades, in both psychological and educational assessment. Many paper-and-pencil tests now have a computer-based equivalent. Innovations in CBT are almost innumerable, and innovative and new CBTs continue to emerge on a very regular basis. Innovations in CBT may best be described along a continuum of

Influences on classification accuracy of exam sets: an example from vocational education and training

BMC Medical Informatics and Decision Making, 2012

Classification accuracy of single exams is well studied in the educational measurement literature... more Classification accuracy of single exams is well studied in the educational measurement literature. However, when making important decisions, such as certification decisions, one usually uses several exams: an exam set. This chapter elaborates on classification accuracy of exam sets. This is influenced by the shape of the ability distribution, the height of the standards, and the possibility for compensation. This

Infeasibility in Automated Test Assembly Models: A Comparison Study of Different Methods

by Angela Verschoor and Bernard Veldkamp

Multidimensional adaptive testing with constraints on test content

Psychometrika, 2002

The case of adaptive testing under a multidimensional response model with large numbers of constr... more The case of adaptive testing under a multidimensional response model with large numbers of constraints on the content of the test is addressed. The items in the test are selected using a shadow test approach. The 0–1 linear programming model that assembles the shadow tests maximizes posterior expected Kullback-Leibler information in the test. The procedure is illustrated for five different

A blending of computer-based assessment and performance-based assessment: Multimedia-Based Performance Assessment (MBPA). The introduction of a new method of assessment in Dutch Vocational Education and Training (VET)

CADMO, 2014

The effect of regulation feedback in a computer-based formative assessment on information problem solving

Computers & Education, 2015

ABSTRACT http://authors.elsevier.com/a/1Qu~h1Hucd1YX5 This study examines the effect of regulatio... more ABSTRACT http://authors.elsevier.com/a/1Qu~h1Hucd1YX5 This study examines the effect of regulation feedback in a computer-based formative assessment in the context of searching for information online. Fifty 13-year-old students completed two randomly selected assessment tasks, receiving automated regulation feedback between them. Student performance was (self-)graded by students and by experts. Expert, as well as student (self)grades showed a significant increase between Task 1 and Task 2. However, further analysis of the expert grades showed significant improvement in performance for girls only. Furthermore, the formative assessment system traced the number of searches and the number of websites consulted per student to complete the two assignments. On average, the results showed that students consulted significantly more websites for Task 2, compared to Task 1. The average number of searches did not differ significantly between Tasks 1 and 2. On the other hand, significant differences were found for those students who, during the evaluation of their performance on Task 1, explicitly stated that they would increase their searches.

ASSESSING FIT OF LATENT REGRESSION MODELS

by Bernard Veldkamp, Sandip Sinharay, and Matthias von Davier

ETS Research Report Series, 2009

Classifying Unstructured Textual Data Using the Product Score Model: An Alternative Text Mining Algorithm

Psychometrics in practice at RCEC, 2012

Use of different sources of information in maintaining standards: examples from the Netherlands

In the different tests and examinations that are used at a national level in the Netherlands, a v... more In the different tests and examinations that are used at a national level in the Netherlands, a variety of equating and linking procedures are applied to maintain assessment standards. This chapter presents an overview of potential sources of information that can be used in the standard setting of tests and examinations. Examples from test practices in the Netherlands are provided that apply some of these sources of information. This chapter discusses how the different sources of information are applied and aggregated to set the levels. It also discusses under which circumstances performance information of the population would be sufficient to set the levels and when additional information is necessary

Selecting Testlet Features With Predictive Value for the Testlet Effect: An Empirical Study

SAGE Open

High-stakes tests often consist of sets of questions (i.e., items) grouped around a common stimul... more High-stakes tests often consist of sets of questions (i.e., items) grouped around a common stimulus. Such groupings of items are often called testlets. A basic assumption of item response theory (IRT), the mathematical model commonly used in the analysis of test data, is that individual items are independent of one another. The potential dependency among items within a testlet is often ignored in practice. In this study, a technique called tree-based regression (TBR) was applied to identify key features of stimuli that could properly predict the dependence structure of testlet data for the Analytical Reasoning section of a high-stakes test. Relevant features identified included Percentage of “If” Clauses, Number of Entities, Theme/Topic, and Predicate Propositional Density; the testlet effect was smallest for stimuli that contained 31% or fewer “if” clauses, contained 9.8% or fewer verbs, and had Media or Animals as the main theme. This study illustrates the merits of TBR in the ana...

Dealing with multiple criteria in test assembly

by Bernard Veldkamp and Mariagiulia Matteucci

It is quite common that tests or exams are being used for more then one purpose. First of all, th... more It is quite common that tests or exams are being used for more then one purpose. First of all, they are used to measure the ability of the students in a reliable manner. Besides, they can be used for pass/fail decisions or to predict future behavior of the candidate, like future job behavior or academic performance. The question remains how to assemble a test that can be used for all these different purposes, that is, how to assemble a multi-objective test. Besides, multiple objectives can result from different purposes, but also from the way test specifications have been implemented. For the WDM-model, for multidimensional IRT, for Cognitive Diagnostic CAT, but also for infeasibility analysis, multiple objective test assembly problems have to be solved. In this paper, a 2-stage method is presented for dealing with multiple objectives in test assembly. In the normalization stage, all objectives are brought on a common scale. In the valorization stage, the different objectives are be...

Designing Item Pools for Adaptive Testing

Elements of Adaptive Testing, 2009

In existing adaptive testing programs, each successive item in the test is chosen to optimize an ... more In existing adaptive testing programs, each successive item in the test is chosen to optimize an objective. Examples of well-known objectives aremaximizing the information in the test at the ability estimate for the test taker or minimizing the deviation of its information froma target value at the estimate. In addition, item selection is required to realize a set of content specifications for the test. For example, item content may be required to follow a certain taxonomy or the answer-key distribution for the test must not deviate too much from uniformity. Content specifications are generally defined in terms of combinations of attributes the items in the test should have. They are typically realized by imposing a set of constraints on the item-selection process. The presence of both an objective and a set of constraints in adaptive testing leads to the notion of adaptive testing as constrained (sequential) optimization problem; for a more formal introduction to this notion, see van der Linden (this volume, chap. 2).

Designing Item Pools for Computerized Adaptive Testing

Computerized Adaptive Testing: Theory and Practice, 2000

In existing computerized adaptive testing (CAT) programs, each successive item in the test is cho... more

Towards an Integrative Formative Approach of Data-Driven Decision Making, Assessment for Learning, and Diagnostic Testing

This study concerns the comparison of three approaches to assessment: Data-Driven Decision Making... more This study concerns the comparison of three approaches to assessment: Data-Driven Decision Making, Assessment for Learning, and Diagnostic Testing. Although the three approaches claim to be beneficial with regard to student learning, no clear study into the relationships and distinctions between these approaches exists to date. The goal of this study was to investigate the extent to which the three approaches can be shaped into an integrative formative approach towards assessment. The three approaches were compared on nine characteristics of assessment. The results suggest that although the approaches seem to be contradictory with respect to some characteristics, it is argued that they could complement each other despite these differences. The researchers discuss how the three approaches can be shaped into an integrative formative approach towards assessment

Application of optimization methods to assessment design

by Bernard Veldkamp and A. Oranje

Psychometric analysis of the performance data of simulation-based assessment: A systematic review and a Bayesian network example

Computers & Education, 2015