Development of a Computerized Adaptive Test for Depression

E.  de Beurs

Development of a Computerized Adaptive Test for Depression

2012, Archives of General Psychiatry

ORIGINAL ARTICLE Development of a Computerized Adaptive Test for Depression Robert D. Gibbons, PhD; David J. Weiss, PhD; Paul A. Pilkonis, PhD; Ellen Frank, PhD; Tara Moore, MA, MPH; Jong Bae Kim, PhD; David J. Kupfer, MD Context: Unlike other areas of medicine, psychiatry is almost entirely dependent on patient report to assess the presence and severity of disease; therefore, it is particularly crucial that we find both more accurate and efficient means of obtaining that report. Objective: To develop a computerized adaptive test (CAT) for depression, called the Computerized Adaptive Test–Depression Inventory (CAT-DI), that decreases patient and clinician burden and increases measurement precision. Design: Case-control study. Setting: A psychiatric clinic and community mental health center. Participants: A total of 1614 individuals with and without minor and major depression were recruited for study. Main Outcome Measures: The focus of this study was the development of the CAT-DI. The 24-item Hamilton Rating Scale for Depression, Patient Health Questionnaire 9, and the Center for Epidemiologic Studies Depression Scale were used to study the convergent validity of the new measure, and the Structured Clinical Interview Author Affiliations: Center for Health Statistics, Departments of Medicine and Health Studies, University of Chicago, Chicago, Illinois (Drs Gibbons and Kim); Department of Psychology, University of Minnesota, Minneapolis (Dr Weiss); and Western Psychiatric Institute, University of Pittsburgh, Pittsburgh, Pennsylvania (Drs Pilkonis, Frank, and Kupfer and Ms Moore). I for DSM-IV was used to obtain diagnostic classifications of minor and major depressive disorder. Results: A mean of 12 items per study participant was required to achieve a 0.3 SE in the depression severity estimate and maintain a correlation of r =0.95 with the total 389-item test score. Using empirically derived thresholds based on a mixture of normal distributions, we found a sensitivity of 0.92 and a specificity of 0.88 for the classification of major depressive disorder in a sample consisting of depressed patients and healthy controls. Correlations on the order of r=0.8 were found with the other clinician and self-rating scale scores. The CAT-DI provided excellent discrimination throughout the entire depressive severity continuum (minor and major depression), whereas the traditional scales did so primarily at the extremes (eg, major depression). Conclusions: Traditional measurement fixes the number of items administered and allows measurement uncertainty to vary. In contrast, a CAT fixes measurement uncertainty and allows the number of items to vary. The result is a significant reduction in the number of items needed to measure depression and increased precision of measurement. Arch Gen Psychiatry. 2012;69(11):1104-1112 MAGINE A 1000-ITEM MATHEMAT- ics test with items ranging in difficulty from basic arithmetic to advanced calculus. Consider 2 examinees: a fourth grader and a graduate student in mathematics. Most questions will be uninformative for both examinees (too difficult for the first and too easy for the second). To decrease examinee burden, we could create a short test of 10 items, equally spaced along the mathematics difficulty continuum. Although this test would be quick to administer, it would provide imprecise estimates of these 2 examinees’ ability because only 1 or 2 items would be appropriate for either examinee. A better approach would be to be- ARCH GEN PSYCHIATRY/ VOL 69 (NO. 11), NOV 2012 1104 gin by administering an item of intermediate difficulty and, based on the response scored as correct or incorrect, select the next item at a level of difficulty either lower or higher. This process would continue until the uncertainty in the estimated ability is smaller than a predefined threshold. This process is called computerized adaptive testing (CAT). To use CAT, we must first calibrate a bank of test items using an item response theory (IRT) model that relates properties of the test items (eg, their difficulty and discrimination) to the ability (or other trait) of the examinee. The paradigm shift is that rather than administering a fixed number of items that provide limited information for any given ex- WWW.ARCHGENPSYCHIATRY.COM ©2012 American Medical Association. All rights reserved. Corrected on November 26, 2012 Downloaded From: https://jamanetwork.com/ on 05/28/2020 Author Affil Health Statis of Medicine University o Illinois (Drs Department University o Minneapolis Western Psy University o Pittsburgh, P Pilkonis, Fra Ms Moore). aminee, we adaptively administer a varying number of items that are targeted to the examinee’s specific level of ability or impairment. CAT allows us to adaptively select a small set of items for each examinee out of a much larger bank of test items, targeting precision by selecting items based on prior ability, trait, or impairment estimates. Although use of CAT and IRT has been widespread in educational measurement, it has been less widely used in mental health measurement.1,2 First, large item banks are generally unavailable for mental health constructs. Second, mental health constructs (eg, depression) are inherently multidimensional, and CAT has primarily been restricted to unidimensional constructs, such as mathematics achievement. Application of unidimensional models to multidimensional data can result in biased trait estimates (severity or impairment) and underestimates of uncertainty.3 Multidimensional IRT-based CAT has been previously used in analysis of the 626-item Mood and Anxiety Spectrum Scales.4 To our knowledge, this was the first study of mental health CAT using a large item bank and multidimensional IRT.5,6 CAT required a mean of 24 items per examinee, yet maintained a correlation of r=0.93 with the full 626-item score. In this article, we apply multidimensional CAT to the measurement of depression using the CAT–Depression Inventory (CAT-DI). Unlike other areas of medicine, psychiatry is almost entirely dependent on patient report (either self-report or reports to evaluators) to assess the presence and severity of disease; therefore, it is particularly crucial that we find more accurate and efficient means of obtaining that report. We use the assessment of depression as an example of what CAT might offer toward that goal. The goal of this report is to describe a new tool for the measurement of depression. With the exception of the previously cited demonstration study4 with the Mood and Anxiety Spectrum Scales, the statistical foundation of the new tool has not been used in any other area of instrument development and provides a statistical advance over other approaches that are based on unidimensional models and much smaller item banks. Although we will explore the extent to which these theoretical advantages translate to gains in measurement precision, reliability, and validity in a future statistical article, the results of this study clearly demonstrate the improvement in fit of the multidimensional bifactor model over traditional unidimensional alternatives. METHODS THE BIFACTOR MODEL The bifactor model5-7 is a multidimensional IRT model that allows each item to measure the primary dimension (eg, depression) and a single subdomain (eg, somatization), hence, the term bifactor. The bifactor model has major computational and interpretational advantages over unrestricted exploratory item factor analytic models8-10 and extends CAT to the measurement of multidimensional constructs. For example, traditional item factor analysis8 is generally restricted to 5 dimensions, and the results may be interpreted differently, depending on the rotation of the solution (eg, varimax vs promax rotations). The bi- factor model has no limitation in terms of the number of dimensions that can be included under the restriction that each item loads on the primary dimension and 1 subdimension. The solution is also rotationally invariant, leading to more direct interpretation. Technical details are provided in the eAppendix (http://www.archgenpsychiatry.com). COMPUTERIZED ADAPTIVE TEST Within CAT, items are selected during the process of test administration for each individual, allowing the test administrator to control measurement precision and to maximize efficiency. CAT includes (1) a precalibrated item bank, (2) an item selection procedure, (3) a scoring method, and (4) a criterion for terminating the test. In the current study, we used maximum item information,11 appropriately modified for the bifactor model (eAppendix), to select items for CAT administration and a dual termination criterion of 0.3 for the SE of the estimated primary dimension score or maximum information remaining in the bank of less than 1.25. In this way, CAT terminates if at an examinee’s currently estimated specific severity level (eg, extreme score) there is insufficient information for any item to achieve the intended level of precision. Relaxing the SE termination criterion in the 2 extremes (floor and ceiling) reduces respondent burden without compromising our ability to rank patients in terms of severity. Technical details are provided in the eAppendix. THE ITEM BANK The total item bank consisted of 452 depression items. We organized the items into conceptually meaningful categories using a hierarchical approach informed by previous empirical (eg, factor analytic) work. Previous work documents that these constructs can be partitioned usefully into subdomains (eg, mood, cognition, and somatic indicators) and factors, and it is informative about the best manifest indicators (items) for operationalizing such factors.12-15 Our hierarchy included domains (eg, depression), subdomains (eg, mood, cognition, and behavior), factors (eg, within depressed mood, factors included increased negative affect and decreased positive affect), and facets (eg, within increased negative affect, facets included sadness, irritability, moodiness, and others). The total number of facets was 46 for depression (eTable 1). Example items from each domain and subdomain are presented in Table 1. A key step in editing the item bank was qualitative review of the items done by consensus among the members of the Pittsburgh research site (see the article by DeWalt et al16 for a description of the qualitative procedures used by the PatientReported Outcomes Measurement Information System [PROMIS] network and adapted here). This process involved identification of redundant items (so that they are not administered in the same testing session), items that were too narrow (often by virtue of being disease specific), items that were confusing or vague, and items that were poorly written. Most items were rated on a 5-point ordinal scale. The items were selected based on a review of more than 100 existing depression or depression-related rating scales (eTable 2). Items were modified to refer to the previous 2-week period and to have consistent response categories. The CAT-DI has a mandatory suicide screening question, which is presented at the end of the CAT session if not previously administered. If positively endorsed, a suicide alert report is generated and the test administrator notified. In the final version of the CAT-DI, an option will be available to have 4 suicide items administered to achieve an even more complete evaluation of suicide risk. ARCH GEN PSYCHIATRY/ VOL 69 (NO. 11), NOV 2012 1105 WWW.ARCHGENPSYCHIATRY.COM ©2012 American Medical Association. All rights reserved. Corrected on November 26, 2012 Downloaded From: https://jamanetwork.com/ on 05/28/2020 PARTICIPANT SAMPLE Table 1. Example Items From Each Domain and Subdomain Domain or Subdomain Mood–negative affect Mood–positive affect Cognition–information deficits Cognition–information unproductive Cognition–impaired view Cognition–social cognition Cognition-hopelessness Cognition-helplessness Cognition-guilt Behavior–low activity Behavior–low energy Behavior-interpersonal Behavior-agitation Somatic–sleep problems Somatic–eating changes Somatic-gastrointestinal Somatic–increased pain Somatic–general somatic Somatic–diurnal variation Suicidal ideation Example Item (In the Past 2 Weeks) Were you serious, introverted, or gloomy? How much did any feelings of depression bother you? How much were you able to relax and enjoy yourself? I was able to reach down deep into myself for comfort. Did you drift in and out of conversations? I had difficulty concentrating. Did you find that silly or unreasonable thoughts kept recurring in your mind? Did you see the future as very bleak? You felt constantly afraid of doing something wrong. How much have you felt inferior to others? Have you felt lonely? You felt as if others were causing all of your problems. To what extent did you feel your life was meaningful? I felt a sense of purpose in my life. Did you feel defeated? I felt like I was at the end of my rope. I felt I should be punished. I was concerned about being forgiven for my sins. Did you have difficulty starting to do anything? I felt emotionally drained from my work. Did you have a lot of trouble getting out of bed in the morning? I felt that everything I did was an effort. How much have you felt like being alone? How much have you felt withdrawn from others? I felt restless as if I had to always be on the move. Did you find it difficult to sit still or to lie down, or you needed to pace the room or be in motion? How satisfied were you with your sleep? My sleep was restless. I did not feel like eating; my appetite was poor. I found food unappealing. Did you repeatedly have distressing physical symptoms, for instance, you had nausea or other stomach or bowel problems? Did you repeatedly have distressing physical symptoms, for instance, you were constipated? Were you more sensitive or less sensitive than usual to heat, cold or pain? Did you repeatedly have distressing physical symptoms, for instance, frequent headaches? Did you feel as if your body were diseased or somehow transformed? Did your mood become depressed when you had a medical problem such as the flu or a cold? Morning was when I felt the best. Has there been any time of day when you felt slower and less energetic? Did you think that life was not worth living? Did you think about taking your own life? Study participants were male and female treatment-seeking outpatients between 18 and 80 years of age. Patients were recruited from 2 facilities, the Western Psychiatric Institute and Clinic (WPIC) at the University of Pittsburgh and a community clinic at DuBois Regional Medical Center (DuBois RMC). DuBois RMC is one of the leading health centers in western Pennsylvania. The Center for Behavioral Health Services at DuBois RMC provides comprehensive inpatient and outpatient psychiatric care. Participants were screened at both the WPIC and Dubois RMC for eligibility. If they had been in psychiatric treatment within the past 2 years, they were considered a psychiatric participant. If they had not had psychiatric treatment within the past 2 years, they were considered a nonpsychiatric control. Psychiatric diagnoses were confirmed by medical records and their treating physician or clinician. Patients with and without a lifetime diagnosis of major depressive disorder (MDD) were included. For psychiatric participants, exclusion criteria included the following: history of schizophrenia, schizoaffective disorder, or psychosis; organic neuropsychiatric syndromes (eg, Alzheimer disease or other forms of dementia, Parkinson disease, and so on); drug or alcohol dependence within the past 3 months (however, patients with episodic abuse related to mood episodes were not excluded); inpatient treatment status; and individuals who were unable or unwilling to provide informed consent. For nonpsychiatric controls, exclusion criteria included the following: any psychiatric diagnosis within the past 12 months; treatment for a psychiatric problem within the past 12 months; positive responses to telephone screen questions; history of schizophrenia, schizoaffective disorder, or psychosis; and individuals who were unable or unwilling to provide informed consent. We report on the analysis of data from 798 individuals used to calibrate the IRT model (WPIC) and 816 individuals who received the live CAT-DI (414 WPIC and 402 DuBois RMC study participants). For simulated adaptive testing, 308 participants (of the 798) took all of the 389 items in the bank after 63 items were deleted from the original set of items, permitting computation of the correlation between results of CAT and total test score; these participants were also part of the calibration sample. The other 490 calibration participants took a subsample of 252 items based on a balanced incomplete block design that maximized the pairings of all items.17 To study the validity of the CAT-DI, using the calibration and simulated CAT phases described, 292 consecutive psychiatric participants received a full clinician-based diagnostic interview via the Structured Clinical Interview for DSM-IV (SCID)18 and the live CAT-DI. Participants received increased compensation ($50 rather than $25) to complete this phase of the study. Demographic characteristics and SCID-based diagnostic prevalence rates are presented in Table 2. The CAT-DI was also administered to 100 nonpsychiatric controls. Nonpsychiatric controls were defined as having no psychiatric treatment within the past 2 years. Controls were recruited via flyers, print media, and Audix messages through the University of Pittsburgh Medical Center. To examine convergent validity, data were also obtained for the Patient Health Questionnaire 9 (PHQ-9),19 24-item Hamilton Rating Scale for Depression (HAM-D),20 and the Center for Epidemiologic Studies Depression Scale (CES-D).21 The HAM-D was administered by a trained clinician, and the PHQ-9 and CES-D were self-reports. Major depression, minor depression (DSM-IV appendix B), and dysthymia were defined according to DSM-IV criteria. The screening information was available to the clinician. ARCH GEN PSYCHIATRY/ VOL 69 (NO. 11), NOV 2012 1106 WWW.ARCHGENPSYCHIATRY.COM ©2012 American Medical Association. All rights reserved. Corrected on November 26, 2012 Downloaded From: https://jamanetwork.com/ on 05/28/2020 Table 2. Demographic Characteristics and Diagnostic Prevalence Rates 80 70 Study Participants, % Sex Male Female Age, y 18-29 30-39 40-49 50-59 ⱖ60 Educational level Some high school or ⬍12th grade High school diploma or GED Some college College graduate Graduate or professional degree Annual income, $ ⬍24 999 25 000-49 999 50 000-74 999 75 000-99 999 ⬎100 000 Not reported Prevalence rates Major depression Minor depression No depression Dysthymia 60 30.0 70.0 No. of Cases Characteristic 21.2 17.1 23.0 26.7 12.0 50 40 30 20 10 0 5.1 21.9 39.8 20.2 13.0 54.1 25.9 9.2 3.9 3.9 3.0 46.7 5.1 45.1 3.1 Abbreviation: GED, graduate educational development. STATISTICAL ANALYSIS Calibration was performed using the bifactor model and a unidimensional alternative, both based on graded IRT models.6 A likelihood ratio ␹2 statistic was used to determine whether the bifactor model improved fit over the simpler unidimensional alternative. The CAT-DI scores were based on expected a posteriori estimates.22 The CAT-DI scores were then used in a logistic regression to predict a physician-based DSM diagnosis of MDD so that the CAT-DI scores could be related to the probability of meeting DSM-IV criteria for MDD. The empirical distribution of the CAT-DI scores was resolved into a mixture of 2 normal distributions23 and compared with models with 1 and 3 component distributions using the Bayesian information criterion (BIC) to determine the best-fitting model. Sensitivity and specificity of the CAT-DI in predicting SCID diagnostic group classification (MDD and MDD plus minor depression) were determined for 3 thresholds based on the estimated mixture distribution (ie, posterior probability of being in the elevated component distribution of 0.50, 0.80, and 0.95). RESULTS CALIBRATION Results of the calibration study revealed that the bifactor model with 5 subdomains (mood, cognition, behavior, somatic, and suicide) significantly improved fit over a uni2 = 6825, P ⬍ .001). A total of dimensional IRT model (␹389 389 items were retained in the model based on having a primary factor loading of 0.3 or greater (96% ⬎0.4 and 79% ⬎0.5). –4 –3 –2 –1 0 1 2 3 CAT-DI Score Figure 1. Observed and estimated frequency distributions of the Computerized Adaptive Test–Depression Inventory (CAT-DI) depression scale scores. SIMULATED CAT Results of simulated CAT revealed that for an SE of 0.3, a mean of 12.31 items per participant (range, 7-22) were required. The correlation between the 12-item mean length CAT and the total 389-item score was r = 0.95. For an SE of 0.4 (less precise), a mean of 5.94 items were required (range, 4-16), but a strong correlation with the 389-item total score (r = 0.92) was maintained. The median length of time required to complete the 12-item (mean) CAT was 2.29 minutes (interquartile range, 1.722.97 minutes) compared with 51.66 minutes for the 389item test. Faster times should be achievable using the final platform (a touch screen device) instead of the Windows-based mouse interface currently used. Mean precision was 0.31, and CAT was terminated for insufficient item information in 15.0% of the cases. In all but 2 cases, the estimated CAT-DI score was less than −2.0, indicating no evidence of depression (too few symptoms to precisely measure). In the other 2 cases, the scores were greater than ⫹2.5, indicating extreme severity (too many symptoms to precisely measure). EMPIRICAL DISTRIBUTION OF THE CAT-DI SCORES Figure 1 reveals that our sample can be resolved into 2 discrete distributions of depressive severity, with the lower component representing the absence of clinical depression and the higher component representing severity levels associated with clinical depression. The BIC indicated best fit (ie, smallest value) for a mixture of 2 normal distributions (BIC = 2406) relative to a single normal distribution (BIC = 2439) or a mixture of 3 normal distributions (BIC = 2414). The means of the 2 distributions are well separated. A total of 40.8% of the sample is in the elevated component distribution that has a mean of 0.17 vs the lower component distribution that has a mean of −1.61. The pooled estimate of the SD is 0.73, making the 2 component distributions approximately 2.5 SDs apart. Threshold scores of −0.61, −0.19, and 0.28 have a .50, .80, and .95 probability of being in the elevated component distribution, respectively. ARCH GEN PSYCHIATRY/ VOL 69 (NO. 11), NOV 2012 1107 WWW.ARCHGENPSYCHIATRY.COM ©2012 American Medical Association. All rights reserved. Corrected on November 26, 2012 Downloaded From: https://jamanetwork.com/ on 05/28/2020 A B 3 30 2 PHQ-9 Score CAT-DI Score 23 1 0 –1 15 8 –2 –3 0 None Minor Depression MDD Minor Depression MDD None Minor Depression MDD D 50 60 40 48 CES-D Score HAM-D Score C None 30 20 10 36 24 12 0 0 None Minor Depression MDD Figure 2. Distributions of the Computerized Adaptive Test–Depression Inventory (CAT-DI) scores for patients who met diagnostic criteria for minor depression (including dysthymia) and major depression disorder (MDD) vs those who did not meet the criteria. A, CAT-DI depression scale score. B, Patient Health Questionnaire 9 (PHQ-9) score. C, 24-Item Hamilton Rating Scale for Depression (HAM-D) score. D, Center for Epidemiologic Studies Depression Scale (CES-D) score. Error bars indicate the range; horizontal lines, the 10th, 50th, and 90th percentile points, respectively. RELATIONSHIP TO DIAGNOSIS OF MINOR AND MAJOR DEPRESSION Figure 2A displays the distributions of the CAT-DI scores for patients who met the diagnostic criteria for minor depression (including dysthymia) and MDD vs those who did not meet the criteria. There is a clear linear progression between the CAT-DI depression severity scores and the SCID diagnostic categories of none, minor depression (including dysthymia), and MDD. The distributions were also reasonably symmetric within each diagnostic group. Means (SDs) and sample sizes were −0.93 (0.75) (n = 117) for no depression, −0.02 (0.58) (n = 29) for minor depression, and 0.47 (0.693) (n = 146) for MDD. Statistically significant differences between no depression and minor depression (t144 = 6.121, P ⬍ .001), no depression and MDD (t261 = 15.736, P ⬍ .001), and minor depression and MDD (t173 = 3.558, P ⬍ .001) were found, with corresponding effect sizes of 1.271, 1.952, and 0.724 SDs, respectively. COMPARISON WITH OTHER DEPRESSION SCALES Convergent validity of the CAT-DI was assessed by comparing results of the CAT-DI with the PHQ-9, HAM-D, and CES-D results. Correlations were r = 0.81 with the PHQ-9, r = 0.75 with the HAM-D, and r = 0.84 with the CES-D. In general, the distribution of scores among the diagnostic categories showed greater overlap (ie, less diagnostic specificity particularly for no depression vs minor depression), greater variability, and greater skewness for these other scales relative to the CAT-DI (Figure 2, B-D). DIAGNOSTIC SCREENING With the 100 healthy controls as a comparator, sensitivity and specificity for predicting MDD using the 50% probability threshold (CAT-DI score = −0.61) were 0.92 and 0.88, respectively. Results were similar for the combination of major and minor depression (0.90 sensitivity and 0.88 specificity). Using patients treated for depression in the past 2 years who did not currently meet DSM-IV criteria for minor depression or MDD as a comparator yielded sensitivity and specificity of 0.92 and 0.64 for MDD and 0.90 and 0.64 for minor depression and MDD combined, respectively. The lower specificity is due to elevated depressive symptoms in patients currently or recently in treatment for depression who did not meet DSM-IV criteria for MDD and/or minor depression. Using the 95% probability threshold of ⫹0.28 increased specificity to 0.98 but decreased sensitivity to 0.63. The increased threshold is rarely (2%) exceeded by patients without MDD, but only 63% of patients with MDD exceeded ARCH GEN PSYCHIATRY/ VOL 69 (NO. 11), NOV 2012 1108 WWW.ARCHGENPSYCHIATRY.COM ©2012 American Medical Association. All rights reserved. Corrected on November 26, 2012 Downloaded From: https://jamanetwork.com/ on 05/28/2020 Table 3. Results of 2 Computerized Adaptive Test Administrations 100 90 Pr(MDD) = 0.97 80 Item (In the Past 2 Weeks) 70 % 60 Pr(MDD) = 0.50 50 Patient 1 (low severity) a I felt depressed. Have you been in low or very low spirits? How much were you distressed by feelings of worthlessness? I had difficulty sleeping. How much have you felt discouraged? Did fatigue interfere with your mood? How often has feeling depressed interfered with what you do? How much were you distressed by feeling everything was an effort? Have you had problems accomplishing less than you would like with your work or other regular daily activities as a result of emotional problems (such as feeling depressed or anxious)? How much of the time have you been moody or brooded about things? Did you feel isolated from others? Patient 2 (high severity) b I felt depressed. Have you felt that life was not worth living? Have you been in low or very low spirits? I felt gloomy. How much have you felt that nothing was enjoyable? How much were you distressed by blaming yourself for things? How much were you distressed by feeling everything was an effort? 50th Percentile 40 30 20 10 7th Percentile 0 –2.5 –1.5 –0.5 0.5 1.5 2.5 CAT-DI Depression Score Figure 3. Percentile rank (among patients with a major depressive disorder diagnosis) and probability (expressed as percentage) of a major depressive disorder diagnosis. The y-axis refers to both of the curves portrayed on the graph. CAT-DI indicates Computerized Adaptive Test–Depression Inventory and Pr(MDD), probability of major depressive disorder. this threshold. Depending on the application, different thresholds can be used. A reasonable balance for application in clinical (depression) samples is to use an 80% classification probability threshold (CAT-DI score of −0.19), which produces sensitivity of 0.82 and specificity of 0.85. As expected, the rate of positive CAT-DI scores (using the 50% threshold of −0.61) was higher (51.7%) at the university psychiatric clinic than at the community mental health clinic (31.1%). The CAT-DI scores were strongly related to MDD diagnosis (odds ratio = 24.19; 95% CI, 10.51-55.67; P ⬍ .001). A unit increase in the CAT-DI score has an associated 24-fold increase in the probability of meeting criteria for MDD. This relationship is shown in Figure 3. Figure 3 also presents the CAT-DI score percentile ranking for patients with DSM-IV–diagnosed MDD. For example, a patient with a CAT-DI score of −0.6 has a .50 probability of meeting criteria for MDD but would be at the lower seventh percentile of the distribution of the CAT-DI scores among patients who met criteria for MDD. By contrast, a patient with a CAT-DI score of 0.5 would have a .97 probability of meeting criteria for MDD and would be at the 50th percentile of patients meeting criteria for MDD. EXAMPLE CAT ADMINISTRATIONS Table 3 presents item-by-item results for 2 CAT admin- istrations: 1 study participant with low severity and 1 with high severity. The participant with low severity required 11 items to achieve an SE less than 0.3, and the participant with high severity required 12 items. The first participant had a score of −0.892, which corresponds to a probability of .33 of meeting criteria for MDD and a percentile of 3.4% among patients with MDD. As such, we would not consider this to be a patient who is currently depressed. By contrast, the second participant had a score of 1.028, which corresponds to a probability of .99 of meeting criteria for MDD and a percentile of 83.9% among patients with MDD. As such, we would consider this to be a patient who is highly likely to be currently depressed. If we had used a less stringent SE termination criterion of 0.4, Response Score (SE) Information Score A little of the time −0.706 (0.615) A little of the time −0.761 (0.530) 4.056 3.490 A little bit −0.679 (0.452) 3.361 A little bit −0.722 (0.431) 3.283 A little bit −0.803 (0.403) 2.720 Occasionally −0.845 (0.375) 2.534 No more than usual −0.812 (0.353) 2.393 A little bit −0.799 (0.335) 2.401 −0.877 (0.324) 2.319 A little of the time −0.905 (0.308) 2.303 A little of the time −0.892 (0.298) 2.207 Most of the time Quite a bit 0.474 (0.621) 0.810 (0.551) 4.056 3.815 Most of the time 0.900 (0.485) 3.547 Quite a bit Quite a bit 0.917 (0.437) 0.951 (0.424) 3.320 2.855 Quite a bit 0.973 (0.384) 2.640 Quite a bit 0.996 (0.353) 2.502 No (continued) we would have terminated item administration for both participants after 6 items and would have obtained severity scores within 5% of the final values. ARCH GEN PSYCHIATRY/ VOL 69 (NO. 11), NOV 2012 1109 WWW.ARCHGENPSYCHIATRY.COM ©2012 American Medical Association. All rights reserved. Corrected on November 26, 2012 Downloaded From: https://jamanetwork.com/ on 05/28/2020 Table 3. Results of 2 Computerized Adaptive Test Administrations (continued) Item (In the Past 2 Weeks) How often did you have negative feelings, such as blue mood, despair, anxiety, or depression? How much difficulty have you been having in the area of mood swings or unstable moods? I could not get going. How much were you distressed by feelings of guilt? I was unhappy. Score (SE) Information Score Often 0.961 (0.342) 2.468 Quite a bit of difficulty 0.994 (0.322) 0.322 Most of the time Quite a bit 1.017 (0.313) 1.029 (0.302) 2.400 2.365 Often 1.028 (0.299) 2.341 Response a For patient 1: score, −0.892; SE, 0.298; probability of major depressive disorder, .33; and percentile among patients with major depressive disorder, 3.4%. b For patient 2: score, 1.028; SE, 0.299; probability of major depressive disorder, .99; and percentile among patients with major depressive disorder, 83.9%. COMMENT Results of this study reveal that we can extract most of the information (r = 0.95) from a bank of 389 depression items using a mean of only 12 items (median of 2 minutes 17 seconds) per study participant. The paradigm shift is that rather than using a fixed number of items and allowing measurement uncertainty to vary, we fix measurement uncertainty to an acceptable level for a given application and allow the number and specific items administered to vary from participant to participant. As an example, changing our termination threshold from an SE of 0.30 to 0.40 decreased the mean number of items administered from 12 to 6, with only a small corresponding decrease in correlation with the total 389-item score (r = 0.95 to r = 0.92). Such efficiency would permit depression screening of large populations necessary for conducting studies of psychiatric epidemiology and determining phenotypes for large-scale molecular genetic studies. The ability to administer the CAT-DI in a few minutes via the Internet, without clinician assistance, makes routine depression screening of patients in primary care possible because the results of the test can be transmitted directly to the medical record and discussed with the patient by the physician at the time of their visit. We note that depressed patients are particularly difficult to assess with a long scale, and the benefits of CAT administration are therefore particularly important in this setting. From a taxonomic point of view, the distribution of the CAT-DI scores in this study sample suggests a mixture of 2 component distributions, one pathological and one not. However, both distributional components appear to have normal distributions, demonstrating that there is considerable interindividual variability in patients with and without clinical depression. The 2 component distributions are well separated by approximately 2.5 SDs, making classification straightforward. A somewhat unexpected result is the jointly high levels of sensitivity (0.92) and specificity (0.88) for the CAT-DI for predicting MDD when using general medical sample controls as the comparison group. A different threshold can be used such that even in samples of patients actively or recently being treated for depression, the CAT-DI accurately predicts MDD. The use of a finite mixture distribution model to empirically derive thresholds for studying sensitivity and specificity for diagnostic classification appropriate for the particular sample evaluated in this study is another unique methodologic feature of this study. The CAT-DI exhibits strong correlation (r = 0.75 to r = 0.84) with other established depression rating scales, both self-rated and clinician rated. Despite this strong association, the CAT-DI appears to be better able to differentiate patients with minor depression from patients who did not meet criteria for minor or major depression. This is not at all surprising because the CAT-DI is adaptively selecting items from a large bank of possible items that are targeted to the specific level of depressive severity of each patient. By contrast, the fixed-length scales consist of a relatively small number of items, for which only a few may be discriminating at any specific level of depressive severity. As a result, they may do well in the extremes (eg, MDD) but have less information for intermediate severity levels. This is one of the major advantages of CAT. As applied in this study, CAT does not solve all measurement problems. For example, if we were interested simply in diagnostic classification, we would be better off adaptively selecting items that maximized measurement precision in the region between healthy and depressed. Furthermore, although the CAT-DI may be extremely helpful in many settings, it is not able to assess functional impairment or assess the immediacy of treatment necessity. There are numerous areas for further study. As previously demonstrated,4 CAT can be applied to a wide variety of areas in mental health measurement, with similar reductions in patient and clinician burden. As a part of the current research program, we have also developed CAT instruments for anxiety and bipolar disorder, which we will soon report. Application to the assessment of depression in children is also viable; however, different item banks would be required and methods for dealing with developmental shifts must also be incorporated. The measurement of depression in different populations (eg, Hispanics) can also be accommodated by translating the CAT-DI item bank into different languages (eg, Spanish) and then testing for differential item functioning between our current sample and a culturally different sample. It is likely that items that may provide excellent discrimination of high and low depressive severity in one culture or race may not perform as well in another. Identifying the most informative items for a given population may provide further benefits over traditional measurement. ARCH GEN PSYCHIATRY/ VOL 69 (NO. 11), NOV 2012 1110 WWW.ARCHGENPSYCHIATRY.COM ©2012 American Medical Association. All rights reserved. Corrected on November 26, 2012 Downloaded From: https://jamanetwork.com/ on 05/28/2020 These same opportunities for future research represent the limitations of the current study. It is unclear how the CAT-DI will function in other cultures. It is unlikely that the CAT-DI is currently suitable for children and adolescents, and further study is also required in elderly populations. In this article, we focus exclusively on the primary depressive dimension; however, there may be interest in the subdomains as well. The results of this study must be cross-validated. The number of patients with minor depression in our study is small. Our estimates of sensitivity and specificity are specific to the population sampled and may be different in other populations, for example, screening depression in an inpatient or outpatient general medical population. The estimated parameters of the mixture distribution (ie, differences in means, SDs, and the proportion in the elevated component distribution) are also sample dependent, and we would expect them to vary, for example, if we looked at a primary care setting where the incidence of depression was lower. Our estimates are based on the combination of an outpatient mental health clinic and a community mental health center in which we would expect a relatively high incidence of patients in the elevated component distribution (ie, high depressive severity). The inclusion of healthy controls, however, allows us to better characterize the lower component distribution. Evaluation of this mixture distribution in other populations (eg, primary care) is important for testing the generalizability of the cut points we have derived for classifying patients as depressed and for assessing sensitivity and specificity. In addition, further statistical research is under way for directly estimating the mixture distribution as a part of the bifactor model and using it to obtain expected a posteriori estimates of the scores. The National Institutes of Health–funded PROMIS initiative has also studied patient-reported outcomes using CAT and IRT, including depression.24,25 The primary differences between PROMIS and the CAT-DI for the measurement of depression are (1) the PROMIS item bank consists of 28 items, whereas the CAT-DI bank consists of 389 items; (2) PROMIS has relied on unidimensional IRT models, whereas the CAT-DI incorporates multidimensionality produced by the sampling of items from distinct subdomains of depression; and (3) the CAT-DI has been developed for measuring the severity of depression and screening for depression, whereas PROMIS has focused on measurement of depressive severity. The theoretical advantage of the large item bank and multidimensional approach to calibration is that the CAT-DI is highly discriminating across the entire depressive severity continuum, and the adaptively administered items are representative of several different underlying domains of depressive symptoms. Forcing unidimensionality results in small item banks that may limit generalizability. Further study of this important issue is under way. The CAT-DI may also have special relevance for longitudinal studies, where the score on the previous measurement occasion can be used as the starting value for the next CAT, leading to even fewer items administered. It would also be of interest to compare the CAT-DI with the HAM-D in terms of identifying treatment effects in randomized clinical trials. Increased precision may be obtained by decreasing the termination SE. This conjecture will be tested in a future study. The CAT-DI depression scale currently exists as a research instrument; however, the Windows-based and webbased versions of the program will be completed at the end of 2012 (see www.healthstats.org for details). The programs will be made available in both standalone and cloud computing environments and will be fully supported. Further work continues on establishing the validity of the CAT-DI anxiety and bipolar scales. Submitted for Publication: August 19, 2011; accepted January 4, 2012. Correspondence: Robert D. Gibbons, PhD, Center for Health Statistics, Departments of Medicine and Health Studies, University of Chicago, 5841 S Maryland Ave, MC 2007 Office W260, Chicago, IL 60637 (rdg@uchicago .edu). Conflict of Interest Disclosures: None reported. Funding/Support: This work was supported by grant R01MH66302 from the National Institute of Mental Health. Online-Only Material: The eAppendix and eTables 1 and 2 are available at http://www.archgenpsychiatry.com. Additional Information: The CAT-DI will ultimately be made available for routine administration, and its development as a commercial product is under consideration. Additional Contributions: We acknowledge the outstanding support of R. Darrell Bock, PhD, Scott Turkin, MD, Damara Walters, BA, Suzanne Lawrence, MA, and Victoria Grochocinski, PhD. We are also indebted to the reviewers for many excellent comments and suggestions. REFERENCES 1. Fliege H, Becker J, Walter OB, Bjorner JB, Klapp BF, Rose M. Development of a Computer-Adaptive Test for Depression (D-CAT). Qual Life Res. 2005;14(10): 2277-2291. 2. Gardner W, Shear K, Kelleher KJ, Pajer KA, Mammen O, Buysse D, Frank E. Computerized adaptive measurement of depression: a simulation study. BMC Psychiatry. 2004;4:13-23. 3. Gibbons RD, Immekus J, Bock RD. Multi-dimensional Models for Outcomes Measurement.http://outcomes.cancer.gov/areas/measurement/multi-dimensional .html. Accessed August 29, 2012. 4. Gibbons RD, Weiss DJ, Kupfer DJ, Frank E, Fagiolini A, Grochocinski VJ, Bhaumik DK, Stover A, Bock RD, Immekus JC. Using computerized adaptive testing to reduce the burden of mental health assessment. Psychiatr Serv. 2008;59(4):361-368. 5. Gibbons RD, Hedeker DR. Full-information item bi-factor analysis. Psychometrika. 1992;57:423-436. 6. Gibbons RD, Bock RD, Hedeker D, Weiss D, Segawa E, Bhaumik DK, Kupfer D, Frank E, Grochocinski V, Stover A. Full-Information item bi-factor analysis of graded response data. Appl Psychol Meas. 2007;31:4-19. 7. Holzinger KJ, Swineford F. The bifactor method. Psychometrika. 1937;2:41-54. 8. Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika. 1981;46:443-459. 9. Gibbons RD, Rush J, Immekus JC. On the psychometric validity of the diagnostic domains of the PDSQ: an illustration of the bi-factor item response theory model. J Psychiatr Res. 2009;43(4):401-410. 10. Cai L. A two-tier full-information item factor analysis model with applications. Psychometrika. 2010;75:581-612. 11. Samejima F. Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph. 1969;17:1-68. 12. Quilty LC, Zhang KA, Bagby RM. The latent symptom structure of the Beck Depression Inventory–II in outpatients with major depression. Psychol Assess. 2010; 22(3):603-608. 13. Santor DA, Gregus M, Welch A. Eight decades of measurement in depression. Measurement. 2006;4:135-155. ARCH GEN PSYCHIATRY/ VOL 69 (NO. 11), NOV 2012 1111 WWW.ARCHGENPSYCHIATRY.COM ©2012 American Medical Association. All rights reserved. Corrected on November 26, 2012 Downloaded From: https://jamanetwork.com/ on 05/28/2020 14. Shafer AB. Meta-analysis of the factor structures of four depression questionnaires: Beck, CES-D, Hamilton, and Zung. J Clin Psychol. 2006;62(1):123-146. 15. Simms LJ, Grös DF, Watson D, O’Hara MW. Parsing the general and specific components of depression and anxiety with bifactor modeling. Depress Anxiety. 2008; 25(7):E34-E46. 16. DeWalt DA, Rothrock N, Yount S, Stone AA; PROMIS Cooperative Group. Evaluation of item candidates: the PROMIS qualitative item review. Med Care. 2007; 45(5)(suppl 1):S12-S21. 17. Cochran WG, Cox GM. Experimental Designs. New York, NY: Wiley; 1957. 18. First MB, Spitzer RL, Gibbon M, Williams JB. Structured Clinical Interview for the DSM-IV Axis I Disorders Clinician Version (SCID-CV). Washington, DC: American Psychiatric Press, Inc; 1996. 19. Kroenke K, Spitzer RL, Williams JBW. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16(9):606-613. 20. Hamilton M. A rating scale for depression. J Neurol Neurosurg Psychiatry. 1960; 23:56-62. 21. Radloff LS. The CES-D scale: a self-report depression scale for research in the general population. Appl Psychol Meas. 1977;1:385-401. 22. Bock RD, Gibbons RD. Factor analysis of categorical item responses. In: Nering M, Ostini R, eds. Handbook of Polytomous Item Response Theory Models: Development and Applications. Florence, KY: Lawrence Erlbaum; 2010. 23. Gibbons RD, Dorus E, Ostrow DG, Pandey GN, Davis JM, Levy DL. Mixture distributions in psychiatric research. Biol Psychiatry. 1984;19(7):935-961. 24. Pilkonis PA, Choi SW, Reise SP, Stover AM, Riley WT, Cella D; PROMIS Cooperative Group. Item banks for measuring emotional distress from the PatientReported Outcomes Measurement Information System (PROMIS姞): depression, anxiety, and anger. Assessment. 2011;18(3):263-283. 25. Teresi JA, Ocepek-Welikson K, Kleinman M, Eimicke JP, Crane PK, Jones RN, Lai JS, Choi SW, Hays RD, Reeve BB, Reise SP, Pilkonis PA, Cella D. Analysis of differential item functioning in the depression item bank from the Patient-Reported Outcomes Measurement Information System (PROMIS): an item response theory approach. Psychol Sci Q. 2009;51(2):148-180. ARCH GEN PSYCHIATRY/ VOL 69 (NO. 11), NOV 2012 1112 WWW.ARCHGENPSYCHIATRY.COM ©2012 American Medical Association. All rights reserved. Corrected on November 26, 2012 Downloaded From: https://jamanetwork.com/ on 05/28/2020

Log In

Development of a Computerized Adaptive Test for Depression

Related papers

Related papers

Related topics