Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Validation of the Behavior of a Knowledge Base Implementing Clinical Guidelines for Point-of-Care Antiretroviral Toxicity Monitoring William Ogallo, RPh, PhD1, Carol Friedman, PhD1, Andrew S. Kanter, MD, MPH1 1Department of Biomedical Informatics, Columbia University, New York, NY Abstract This study investigated the automated detection of antiretroviral toxicities in structured electronic health records data. The evaluation compared responses generated by 5 clinical pharmacists and 1 prototype knowledge-based application for 15 randomly selected test cases. The main outcomes were inter-subject dissimilarity of responses quantified by the Jaccard distance, and the mean proportion of correct responses by each subject. The statistical differences in intersubject Jaccard distances suggested that the prototype was inferior to clinical pharmacists in the detection of possible antiretroviral toxicity associations from structured data. The reason for dissimilarities was attributable to inadequate domain coverage by the prototype. The differences in the mean proportion of correct responses between the clinical pharmacists and the prototype were statistically indistinguishable. Overall, this study suggests that knowledge-based applications have the potential to support automated detection of antiretroviral toxicities from structured patient records. Furthermore, the study demonstrates a systematic approach for validating such applications quantitatively. Introduction The World Health Organization (WHO) recognizes the need to improve the monitoring of antiretroviral toxicity 1. It places particular emphasis on underserved settings where the range and patterns of antiretroviral toxicities may alter the need for and frequency of antiretroviral toxicity monitoring 2. The WHO recommends a symptom-directed antiretroviral monitoring approach where clinicians the assess signs and symptoms reported by patients and subsequently draw conclusions about antiretroviral toxicities 1, 3. However, the labor intensity associated with gathering and analyzing data using the symptom-directed approach limits its utility in settings that face health workforce challenges 4, 5. Consequently, it is essential to develop strategies that improve the quality of the symptomdirected antiretroviral toxicity monitoring approach. Empirical research suggests that the use of electronic point-ofcare clinical decision support (CDS) tools in the management of HIV improves adherence to clinical guidelines and makes the collection, analysis, and interpretation of data easier 6, 7. However, there is little evidence to demonstrate how such tools could be applied to improve symptom-directed monitoring of medication safety within HIV care workflows. Furthermore, research in the development and application of standard definitions, formatting, and reporting associated with symptom-directed monitoring of antiretroviral toxicity is limited. Knowledge-based CDS systems could potentially improve the quality of point-of-care antiretroviral toxicity monitoring. The term ‘knowledge base’ refers to a repository of facts, heuristics, and models that represent domain knowledge that can be used for problem-solving and analysis of organized data 8. Consequently, a knowledge-based system is a software application that uses the knowledge stored in its knowledge base to analyze problems and provide advice within a restricted domain and like human domain experts 9. Unlike mathematical or statistical approaches that use numerical representation and arithmetic manipulation to quantitatively model relationships that support inferences in a given domain, knowledge-based approaches rely on symbolic manipulations that use ontologies and apply logic to draw conclusions from asserted facts 9, 10. This reasoning is exemplified by the deduction of clinical diagnoses from observations of symptoms and laboratory findings 9, 10. Knowledge-based systems are primarily developed and used to increase the reproducibility, scalability, and accessibility to complex reasoning and decision-making tasks 11. Within the biomedical domain, knowledge-based systems are foundational applications that have remained popular to date 9, 11, 12 . An excellent example of a knowledge-based system is the MYCIN system. This rule-based computer-assisted decision support system was developed in 1976 by Ted Shortliffe et al. to support inference on the selection of antibiotic therapy for patients with bacterial infections 9, 12. Currently, knowledge-based systems are applied in several biomedical domains including but not limited to clinical decision support systems, surveillance in public health datasets, and hypothesis generation in large-scale research datasets 11. During the development of a knowledge-based application and before it is deployed for actual use, it is essential to ensure that its structure and behavior are free of detrimental design flaws 13. Validation studies assess the quality of a knowledge-based application by examining the functional completeness and the predictive accuracy of its knowledge base 14. These evaluations assess whether the knowledge base satisfactorily represents domain knowledge and whether 827 non-design experts (domain experts who did not participate in the development of the knowledge-base application) agree that the information, rules, and procedures in the knowledge base are complete and accurate 14. While structural validation evaluates the similarities in how a knowledge base and non-design experts conceptualize and structurally represent knowledge, behavioral validation uses test cases to evaluate the similarity and compare the accuracy of outputs made by the knowledge base and by non-design experts 13, 14. This paper describes the validation of the behavior of a knowledge-based application prototype that implements standard clinical guidelines for the point-of-care monitoring of antiretroviral toxicities. The goal of our analysis was to ascertain that the prototype generates patient-specific antiretroviral toxicity reports that are sufficiently accurate for clinical use. Specifically, we evaluated the similarity and accuracy of antiretroviral toxicity reports generated by the prototype compared to non-design human experts for a random sample of test cases. This paper reports our findings. Methods Generation of Antiretroviral Toxicity Summary Reports We developed a knowledge-based application prototype intended to facilitate the documentation and analysis of antiretroviral toxicity data within ambulatory HIV care workflows. The prototype generates patient-specific summary reports that describe possible antiretroviral toxicities and their risk factors detected from electronic health records (EHR) data. The core content of the prototype’s knowledge base were derived from standard care guidelines and FDAapproved drug labels, and pertain to the major types of antiretroviral toxicities described in the WHO guidelines on the use of antiretroviral drugs 1. Table 1 lists these antiretroviral toxicities. The prototype organizes medication, regimen, and toxicity domain knowledge in a manner that support reasoning through the traversal of the relationships described in its knowledge base. Similar to diagnostic decision support systems 15, clinicians can use the reports generated by the prototype to confirm or rule out antiretroviral toxicities experienced by individual patients, and if necessary conduct additional assessments to narrow down diagnoses. The prototype’s detection of antiretroviral toxicities is based on the “Possible” causal category of the World Health Organization-Uppsala Monitoring Center system for standardized causality assessment 16. This criterion requires the ascertainment of a reasonable temporal association between medication administration and the occurrence of toxicity. The prototype functions as follows. First, given a patient identifier, the prototype queries longitudinal EHR data to select the list of medications that constitute the patient’s active antiretroviral regimen and the dates when each drug was first prescribed. Next, the prototype queries the longitudinal EHR data to select the patient’s clinical observations and the dates when each observation was made. Subsequently, the prototype creates tuples consisting of the medications, clinical observations, and the date difference between the date when the clinical observation was recorded and the date when the antiretroviral drug was ordered. It then matches the selected tuples to relationships between medications, clinical observations, and predetermined time frames that are defined in its knowledge base. In so doing, it identifies the tuples in which the medications and clinical observations have temporal relationships that suggest possible antiretroviral toxicities. Lastly, the prototype matches the identified tuples with the antiretroviral toxicity concept-concept relationships in its knowledge base to generate a list of possible antiretroviral toxicities as output. Consequently, if the prototype finds the medication abacavir with the recording date 2017-01-14 and the observation rash with the recording date 2017-01-25, it generates the output abacavir hypersensitivity since rash is a manifestation of hypersensitivity due to abacavir. The detection of possible risk factors proceeds similarly. However, some risk factor observations do not require temporal association with the administration of medications. For such risk factors, the prototype select lists of active medications and the list of observations and matches these to the antiretroviral toxicity risk factor relationships in its knowledge base regardless of the dates when they were recorded. For example, if the medication nevirapine is identified as active and the observation female gender is also identified, then the prototype generates the output female gender is a risk factor of nevirapine hepatoxicity. Study Design This behavioral validation study compared the detection of antiretroviral toxicities, risk factors, and toxicity observations (symptoms, signs, and laboratory findings) from structured data by non-design human experts and a prototype knowledge-based application. Specifically, this study evaluated the similarity and the accuracy of reports generated by 5 clinical pharmacists and the prototype for 15 random test cases. The comparisons were conducted in an open domain in which the universe of possible responses was not controlled, and in a restricted domain in which possible responses were constrained to the knowledge content available in the prototype’s knowledge base. 828 Table 1. Major antiretroviral toxicities described in the WHO HIV guidelines (2016) ARV Toxicity Abacavir • Hypersensitivity reaction Atazanavir/r • Electrocardiographic abnormalities (PR and QRS interval prolongation) • Indirect hyperbilirubinemia (clinical jaundice) • Nephrolithiasis Zidovudine • • • • • Severe anemia, neutropenia Lactic acidosis or severe hepatomegaly with steatosis Lipoatrophy Lipodystrophy Myopathy Darunavir/r • • • • Hepatotoxicity Severe skin and hypersensitivity reactions Hepatotoxicity Hypersensitivity reactions Dolutegravir Efavirenz • Persistent central nervous system toxicity (such as dizziness, insomnia, abnormal dreams) or mental symptoms (anxiety, depression, mental confusion) • Convulsions • Hepatotoxicity • Severe skin and hypersensitivity reactions • Gynecomastia • Severe skin and hypersensitivity reactions Lopinavir/r • • • • • • • Nevirapine Electrocardiographic abnormalities (PR and QRS interval prolongation, torsades de pointes) Hepatotoxicity Pancreatitis Dyslipidaemia Diarrhea Hepatotoxicity Severe skin rash and hypersensitivity reaction, including Stevens-Johnson syndrome Raltegravir • Rhabdomyolysis, myopathy, myalgia • Hepatitis and hepatic failure • Severe skin rash and hypersensitivity reaction Tenofovir • • • • Chronic kidney disease Acute kidney injury and Fanconi syndrome Decreases in bone mineral density Lactic acidosis or severe hepatomegaly with steatosis Study Procedure The procedure used in the study was loosely based on the framework for validation of rule-based systems by Knauf et al. 17 and was in concordance with standard procedures for evaluating knowledge bases 14. The Knauf framework describes a process involving the generation of test scenarios and the use of a Turing Test-like approach to evaluating the responses of a rule-based system to the test scenarios 17. The key steps applied in this study were test case generation, test case presentation and experimentation, and data analysis. These steps are described below. a) Test case generation The first step of the behavioral comparisons was the creation of test cases. In this study, the test cases were derived from raw data in published case reports on antiretroviral toxicities. In October 2016, a literature search was conducted to retrieve published case reports on antiretroviral toxicities. The case reports were identified by electronically 829 searching the Ovid Medline® database. The search strategy involved the use of medical subject heading (MeSH) terms and search strings associated with the antiretroviral toxicities of interest and was limited to case reports having abstracts and published in English between the year 2000 and 2016. Table 2 lists the queries used to obtain the case reports. A total of 114 case reports were identified out of which 6 duplicates were removed. Four reviewers independently reviewed the titles and abstracts of the case report articles retrieved from the search. Each article was independently reviewed by two reviewers, and each reviewer reviewed 54 articles. A fifth reviewer reviewed all the 108 articles and acted as a tie-breaker during the selection of the articles. The goal of the review was to identify antiretroviral toxicity case reports in which the responsible medication, as well as the patient biodata, signs, symptoms and laboratory findings, were reported. Reviewers were asked to include an article if and only if an adverse drug reaction was reported or described, the culprit drug was mentioned, and the reported case was about HIV/AIDS. They were also asked to identify the case reports that described patient characteristics such as age, gender, and weight as well as signs, symptoms, and laboratory findings. The reviewers were asked to exclude case reports that were solely about the use of antiretroviral medications for the management of hepatitis infections, reports that only addressed treatment efficacies and reports that were about genetics, tumors, or immunotherapy. Table 3 shows the consensus between pairs of reviewers who reviewed the same case reports estimated using percent agreement and Cohen’s Kappa. The ratings from reviewer 1 were dropped based on the high rate disagreements with the other reviewers. A total of 62 cases were identified from the 55 articles that were eventually included in the study. The 62 cases, available as raw textual narratives, were structured and annotated to enable input and analysis by our prototype. The annotation was done using the National Center for Biomedical Ontology (NCBO) Annotator. The NCBO Annotator is an ontology-based web service for annotating raw texts with ontology concepts from several biomedical terminology vocabularies in the Unified Medical Language (UMLS) Metathesaurus and the NCBO Bioportal repositories 18, 19. For example, the text “A female patient using Amoxicillin complained of Rash” would be annotated with several UMLS concepts including female (CUI C0086287), amoxicillin (CUI C0002645), rash (CUI C0015230). The annotation process was done via the NCBO Annotator’s Representational State Transfer (REST) Web Service, with the ontology sources restricted to RxNorm, MEDDRA, and LOINC. The resulting annotations for each of the 62 cases were manually grouped into 5 categories: descriptive characteristics (e.g., age, weight, gender), comorbidities, medications, signs/symptoms/findings, and laboratory test results. Two reviewers independently reviewed the structured annotations for each case. The goals of this review were to counter check if the annotated concepts were indeed present in the raw text of the case, to identify redundant and synonymous concepts, to add concepts that were not identified by the annotator, and to fill in numeric values and reference ranges for concepts that had numeric values. The two reviewers compared their reviews for each case with discrepancies solved by consensus after confirmation with the raw text for the case in question. A stratified random sampling procedure was used to select the 15 test cases that were presented to the prototype and the experts in the study. Stratification by the type of antiretroviral medications and the type of antiretroviral drug toxicities was used to minimize sample selection bias. b) Test case presentation and experimentation Each test case was described as a pair of input test data and the corresponding output responses. Figure 1 illustrates an example of a test case. The input test data for a given test case was comprised of structured lists of the biodata, comorbidities, medications, signs/symptoms/findings, and laboratory test results. The input data was presented in two formats. In the first format, the test data was presented as observations in an OpenMRS database (MySQL) to enable analysis by the prototype. In the second format, the input data for a given test case was presented as a structured clinical vignette for the human experts using Google Forms. The output for a given test case was defined as lists of 1) Possible antiretroviral toxicities, 2) Possible antiretroviral toxicity risk factors, and 3) Possible antiretroviral toxicity observations (signs, symptoms, and laboratory results). All the 15 selected test cases were processed by the prototype and 5 clinical pharmacists who did not participate in the development of the prototype. The prototype processed the input data by querying its knowledge base and generating lists of 1) Ingredient-Toxicity pairs, 2) Ingredient-Toxicity-Risk Factor triples and 3) Ingredient-ToxicityToxicity Observation triples. The human experts processed the input data by selecting choices to three multi-answer questions about each test case: 1) What antiretroviral toxicities could plausibly be identified from the case above? 2) What antiretroviral toxicity risk factors could plausibly be identified from the case above? 3) What antiretroviral toxicity manifestations could plausibly be identified from the case above? 830 Table 2. Terms and strategy for literature search in Ovid Medline Search 1 Results 5 (Abacavir or ABC or Atazanavir or ATV or "ATV/r" or Dolutegravir or DTG or Darunavir or DRV or "DRV/r" or Efavirenz or EFV or Etravirine or ETV or ETR or Lopinavir or LPV or "LPV/r" or Nevirapine or NVP or Raltegravir or RAL or Tenofovir or TDF or Zidovudine or ZDV or AZT).ti. ("Drug-Related Side Effects and Adverse Reactions" or "Acidosis, Lactic" or "Acute kidney injury" or "Bone density" or "Drug Hypersensitivity" or "Drug-Induced Liver Injury" or "Fanconi syndrome" or "Fatty Liver" or "Heart Conduction System/abnormalities" or "Muscular Diseases" or "Renal Insufficiency, Chronic" or "Sleep Initiation and Maintenance Disorders" or Anemia or Anxiety or Confusion or Depression or Diarrhea or Dizziness or Dreams or Dyslipidemias or Gynecomastia or Hepatomegaly or Hyperbilirubinemia or Jaundice or Lipodystrophy or Nephrolithiasis or Neutropenia or Pancreatitis or Rhabdomyolysis or Seizures).sh. ("adverse drug reaction" or "adverse reaction" or "adverse drug event" or "adverse reaction" or "adverse event" or "toxicity" or allerg$ or "Abnormal Dreams" or "Acute kidney failure" or "Acute kidney injury" or "Acute renal failure" or "An?emia" or "Bone density" or "bone mineral density" or "breast enlargement" or "Central Nervous System Toxicity" or "Chronic Kidney Disease" or "Chronic Kidney Failure" or "Chronic Kidney Insufficiency" or "Chronic Renal Disease" or "Chronic Renal Failure" or "Chronic Renal Insufficiency" or "Drug-Induced Liver Injury" or "Electrocardiographic abnormalities" or "Enlarged Liver" or "Fanconi syndrome" or "Fatty Liver" or "Heart Conduction disorder" or "Hepatic failure" or "Hepatic Injury" or "Hepatic toxicity" or "Hepatomegaly" or "Hyperbilirubin?emia" or "Icterus" or "Jaundice" or "Kidney Stone" or "Kidney Stones" or "Lactic Acidosis" or "Liver Enlargement" or "Liver failure" or "Liver injury" or "Liver toxicity" or "loose bowel movement" or "Mental symptoms" or "Muscular Disease" or "Nephrolithiasis" or "Neutrop?enia" or "PR interval prolongation" or "QRS interval prolongation" or "QT interval prolongation" or "Renal colic" or "Renal Lithiasis" or "Skin reaction" or "Steatosis" or allerg$ or Anxiety or Cholesterol or Cholesterol?emia or Confusion or Convulsion? or Depression or Diarrh?ea or Dizziness or Dyslipidemia or Eruptions or Gyn?ecomastia or Hepatitis or Hepatotoxicity or Hypercholesterol?emia or Hypersensitivity or Hypertriglycerid?emia or Insomnia or Lipoatrophy or Lipodystrophy or Myalgia or Myopathy or Pancreatitis or Rash or reaction or Rhabdomyolysis or Seizure? or Triglycerid?emia or Triglycerides).ti,ab,kw. (Didanosine or ddI or Stavudine or d4T or Saquinavir or SQV or Indinavir or IDV or Tipranavir or TPV or Fosamprenavir or FPV or Rilpivirine or RPV or Cobicistat or COBI or Elvitegravir or EVG or Pharmacokinetics or Pregnancy or "Postpartum Period" or Postpartum or Infant or Child or "in vitro" or Prophylaxis or transplant or Transplantation or neonate or "chronic hepatitis B" or Efficacy).ti. animal/ not (human/ and animal/) 6 (1 and (2 or 3)) not (4 or 5) 7 limit 6 to (abstracts and english language and "case reports" and yr="2000 - 2016") 114 8 remove duplicates from 7 108 2 3 4 15423 593315 2315763 4470056 4285612 2137 Table 3. Inter-rater reliability between reviewers Raters Percent Kappa rater1 & rater2 61.1 0.2 rater1 & rater5 70.4 0.4 rater2 & rater5 90.7 0.8 rater3 & rater4 92.6 0.8 rater3 & rater5 92.6 0.8 rater4 & rater5 96.3 0.9 831 Figure 1. Example of the input data (blue) and output data (green) that constitute a Test Case used in the study. c) Data Evaluation The evaluation methodology applied in this study was leveraged from Hripcsak et al.’s foundational work on evaluating the automated detection of clinical conditions from narrative reports using natural language processing 20. As previously described, the evaluation entailed comparing responses generated by 5 human experts and 1 prototype for 15 randomly selected test cases. An additional algorithm that randomly guessed responses with 50% chance of getting the correct answer (based on a majority vote by the experts) was added for comparison. The primary outcome of the behavioral evaluation in this study was the pairwise inter-subject judgmental dissimilarity quantified by the Jaccard distance. Explicitly, this distance was defined as one minus the number of response elements in common between the sets of responses by a subject 𝑗 and a subject 𝑘 divided by the number of response elements by the two subjects for a given test case 𝑖 as described in the equation below. The Jaccard distance has a range 0 ≤ 𝑑𝑖𝑗𝑘 (𝑋, 𝑌) ≤ 1 with a higher value implying greater dissimilarity. 𝑑𝑖𝑗𝑘 (𝑋𝑖𝑗 , 𝑋𝑖𝑘 ) = 1 − |𝑋𝑖𝑗 ∩ 𝑋𝑖𝑘 | |𝑋𝑖𝑗 ∪ 𝑋𝑖𝑘 | The average Jaccard distance between each pair of subjects was computed as the mean Jaccard distance across all the 15 test cases in the study. For each expert, the mean Jaccard distance from the other 4 experts was computed. For nonexpert subjects, the mean Jaccard distance from all the 5 experts was computed. The research hypothesis that the mean Jaccard distance to the group of experts was different for at least one of the subjects was tested using analysis of variance. The secondary outcome of the behavioral evaluation was the proportion of responses by each subject that were correct, relative to a reference standard based on the majority opinion of the experts. The correctness of responses for a given test case 𝑋𝑖 was defined as the number of responses in common between a subject 𝑗 and the reference standard 𝑘 divided by the number of responses by the subject: 𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑛𝑒𝑠𝑠 = |𝑋𝑖𝑗 ∩ 𝑋𝑖𝑘 | 𝑋𝑖𝑗 Analogous to the dissimilarity evaluation, an expert’s reference standard was based on the majority vote of the remaining 4 experts, while the prototype’s and random guessing reference standard was based on the majority vote of all the 5 experts. The research hypothesis that there was a difference in the mean correctness of the prototype and the mean correctness of the experts was tested using analysis of variance. In addition to the correctness evaluations, pairwise Kappa statistics were determined to assess consistency in responses between the subjects and the reference standards used. 832 Results The 5 experts and the prototype generated 314 unique responses from the 15 test cases. Of these responses, 66 were about antiretroviral toxicities, 109 were about risk factors, and 139 were about toxicity observations. Based on the majority opinion of the experts, 199 responses (63%) were considered correct. Of the 314 responses generated, 70 responses (22%) did not exist as evidence in the prototype’s knowledge base. The mean Jaccard distances of each subject from the experts are illustrated in Figure 2. The comparisons between the mean Jaccard distances of each subject and the mean Jaccard distance of the experts are provided in Table 4. Although the experts differed in their interpretation of the test cases for at least 25% of the time, the differences in the Jaccard distances of the experts from each other was not statistically significant. This observation was confirmed by a Fleiss’s Kappa score of 0.77 that indicated substantial agreement among the 5 experts. When all the responses were accounted for (unrestricted), the mean Jaccard distance of the experts from each other was 0.312 (95% CI, 0.283 to 0.342), while the mean Jaccard distance of the prototype from the experts was 0.424 (95% CI, 0.382 to 0.466). The difference between these two distances was 0.112 (0.06 to 0.163, p-value <0.001) suggesting statistically significant differences between responses by experts and by the prototype. However, the distance of the prototype and the experts was smaller than the distance between the prototype and random guessing at 50% chance of being correct (Figure 2 - unrestricted). Interestingly, restricting the universe of responses (by ignoring the 70 responses that did not exist in the knowledge base for all subjects) resulted in the difference between the distance of the prototype from the experts and the average distance among the experts becoming statistically indiscernible. The removal of the responses did not appear to significantly affect the distances of the other subjects from the experts (Figure 2 - restricted). Figure 3 illustrates the means and 95% confidence intervals for the proportion of correct responses by a subject. When responses were unrestricted, the mean correctness of the prototype across all test cases was 79.5% (95% CI, 71.9 to 87.2). Based on inspection of the confidence interval overlaps and on the one-way ANOVA model using all subjects, there was insufficient evidence to conclude that difference between the mean correctness of the prototype and the mean correctness of the human experts was statistically significant (p-value>0.5). Similar conclusions were found when the responses were restricted. Lastly, a Cohen’s Kappa score of 0.68 indicated moderate agreement between the prototype’s responses and reference standard responses derived from the majority opinion of the experts. These observations collectively suggest that the fact that the prototype’s accuracy is equivalent to human expert accuracy cannot be ruled out. Discussion The behavioral validation conducted in this study involved comparing the detection of antiretroviral toxicities, risk factors, and observations (signs, symptoms, and laboratory findings) by the prototype and by non-design experts for a random sample of test cases. The findings of this study suggest that the knowledge base of the prototype developed in this study behaves as human domain experts albeit to a moderate degree. There was sufficient evidence to conclude that there was a statistical difference in the detection of antiretroviral toxicities, risk factors, and observations (sign, symptoms, and laboratory findings) from structured data between the prototype and human experts. Nonetheless, the reports generated by the prototype tended to be more similar to human expert reports than to reports generated through random guessing. The accuracies of the prototype and the human experts were indistinguishable. Interestingly, when the universe of responses was restricted to the knowledge that was available in the prototype, the dissimilarities between the reports generated by the prototype and the human experts became indistinguishable, while the dissimilarities among reports generated by the experts remain unchanged. This observation confirms the wellknown assertion that for a knowledge base to be considered functionally complete, it must not only be structured appropriately and contain accurate knowledge, but it must also have adequate coverage of the domain knowledge 14. However, as was the case with the development of the prototype in this study, it is not always possible or reasonable to ensure complete domain coverage particularly in the early stages of the development of knowledge-based applications. Furthermore, when using standard guidelines as the basis for the content of the knowledge base of knowledge-based applications, inadequate domain coverage is likely. This is because care guidelines tend to provide content about key treatment-limiting conditions that are most impactful in clinical care. 833 Figure 2. Mean Jaccard Distance (and 95% Confidence Interval) of Subjects from Experts for unrestricted responses (top) and restricted responses (bottom) Table 4. Differences between Mean Subject and Mean Expert Jaccard Distances Category Unrestricted Restricted Subject Difference (95% CI) p-value Expert 1 -0.025 (-0.081 to 0.03) 0.357 Expert 2 -0.003 (-0.059 to 0.052) 0.904 Expert 3 -0.018 (-0.073 to 0.038) 0.52 Expert 4 0.051 (-0.004 to 0.107) 0.067 Expert 5 -0.005 (-0.06 to 0.05) 0.851 Prototype Guessing 0.112 (0.06 to 0.163) 0.321 (0.27 to 0.372) <0.01* <0.01* Expert 1 -0.01 (-0.059 to 0.039) 0.685 Expert 2 0.004 (-0.045 to 0.053) 0.88 Expert 3 -0.03 (-0.078 to 0.019) 0.227 Expert 4 0.025 (-0.024 to 0.074) 0.312 Expert 5 0.011 (-0.038 to 0.06) 0.648 Prototype Guessing 0.037 (-0.009 to 0.082) 0.342 (0.297 to 0.387) 0.108 <0.01* 834 Figure 3. The proportion of Correct Responses (and 95% Confidence Interval) of Subjects from Experts for unrestricted responses (top) and restricted responses (bottom) It was also interesting to observe that although the dissimilarities of reports among the experts were statistically indistinguishable, the proportion of time they disagreed with each other was as high as 25%. This suggests variability in the manner in which experts interpret antiretroviral toxicity information albeit the fact that no single expert is significantly different from the others. Hripcsak et al. reported a similar observation among expert physicians identifying conditions from radiology reports 20. In our study, it is likely that the variability among the human expert as well as among the original authors of case reports used contributed to the generation of the 70 responses that were not available as evidence in the prototype’s knowledge-base. It is not clear why the experts in this study interpreted the reports differently but this could be as a result of local influences and shared experiences that determine how the experts perceive knowledge about antiretroviral toxicities beyond what is described in standard sources such as treatment guidelines and drug labels. This study had several limitations. First, only clinical pharmacists were used as human expert subjects, and there was no gold standard measure for antiretroviral toxicity. Nonetheless, the clinical pharmacists who participated in the study were carefully selected and using their majority opinion as the reference for testing the prototype in the absence of a gold standard was deemed credible. Additional research is, however, needed to extend the findings of this study to other health workforce cadres and to non-experts. Second, the evaluated application was an initial prototype. It is possible that as the iterative development of the application continues, future conclusions about its behavior would change. Third, the study relied on evaluating structured data only. It is possible that solutions investigating the automated detection of antiretroviral toxicities from unstructured data may report different findings. Conclusion Overall, this study suggests that it is possible to implement antiretroviral toxicity domain knowledge in knowledgebased applications successfully and that such applications have the potential to support automated detection of antiretroviral toxicities from structured patient records. This study also demonstrates a rigorous systematic methodology for evaluating such applications quantitatively. Future research should delineate novel ways of dealing with uncertainty and inadequate domain coverage, and in controlling the duration of validation processes when developing knowledge base applications that implement clinical guidelines. Further research is also needed to investigate the impact of the variability among expert subjects on the outcomes of studies investigating the behavior of knowledge-based applications. Additionally, the knowledge-based approach applied in this study could be 835 investigated further to support surveillance and hypothesis generation in large-scale public health and research datasets. Acknowledgments We thank Drs. Dorothy Aywak, Prashant Mandalya, Seema Shah, Jilna Shah, and Lisper Njeri for reviewing the case reports in this study, and Drs. Imran Manji, Wilson Irungu, David Wanje, Benson Njuguna, and Dennis Thirikwa for serving as non-design evaluation experts. We also thank Drs. Gilad Kuperman, Lena Mamykina, and Martin Were for providing feedback on this research, and Dr. George Hripcsak for supporting the completion of the study financially. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. World Health Organization. Consolidated guidelines on the use of antiretroviral drugs for treating and preventing HIV infection: recommendations for a public health approach. 2016. World Health Organization. March 2014 supplement to the 2013 consolidated guidelines on the use of antiretroviral drugs for treating and preventing HIV infection: recommendations for a public health approach. 2014. World Health Organization. Antiretroviral therapy of HIV infection in infants and children: towards universal access: recommendations for a public health approach-2010 revision: World Health Organization; 2010. King RC, Fomundam HN. Remodeling pharmaceutical care in Sub-Saharan Africa (SSA) amidst human resources challenges and the HIV/AIDS pandemic. The International journal of health planning and management. 2010;25(1):30-48. Hawthorne N, Anderson C. The global pharmacy workforce: a systematic review of the literature. Human resources for health. 2009;7:48. Were MC, Nyandiko WM, Huang KT, Slaven JE, Shen C, Tierney WM, et al. Computer-generated reminders and quality of pediatric HIV care in a resource-limited setting. Pediatrics. 2013;131(3):e789-96. Oluoch T, Santas X, Kwaro D, Were M, Biondich P, Bailey C, et al. The effect of electronic medical recordbased clinical decision support on HIV care in resource-constrained settings: a systematic review. International journal of medical informatics. 2012;81(10):e83-92. Shortliffe EH, Barnett GO. Biomedical data: Their acquisition, storage, and use. Biomedical informatics: Springer; 2014. p. 39-66. Musen MA, Middleton B, Greenes RA. Clinical decision-support systems. Biomedical informatics: Springer; 2014. p. 643-74. Rubin DL, Greenspan H, Brinkley JF. Biomedical imaging informatics. Biomedical Informatics: Springer; 2014. p. 285-327. Payne PR. Chapter 1: Biomedical knowledge integration. PLoS Comput Biol. 2012;8(12):e1002826. Wraith SM, Aikins JS, Buchanan BG, Clancey WJ, Davis R, Fagan LM, et al. Computerized consultation system for selection of antimicrobial therapy. Am J Hosp Pharm. 1976;33(12):1304-8. Guida G, Mauri G. Evaluating performance and quality of knowledge-based systems: foundation and methodology. IEEE Transactions on Knowledge and Data Engineering. 1993;5(2):204-24. Adelman L, Riedel SL. Handbook for evaluating knowledge-based systems: Conceptual framework and compendium of methods: Springer Science & Business Media; 2012. Miller RA. Diagnostic decision support systems. Clinical decision support systems: Springer; 2016. p. 181208. World Health Organization. Uppsala Monitoring Centre. The use of the WHO-UMC system for standardised case causality assessment. 2014. Knauf R, Gonzalez AJ, Abel T. A framework for validation of rule-based systems. IEEE transactions on systems, man, and cybernetics Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society. 2002;32(3):281-95. Jonquet C, Shah N, Youn C, Callendar C, Storey M-A, Musen M, editors. NCBO annotator: semantic annotation of biomedical data. International Semantic Web Conference, Poster and Demo session; 2009. Jonquet C, Shah N, Musen M, editors. The open biomedical annotator. AMIA summit on translational bioinformatics; 2009. Hripcsak G, Friedman C, Alderson PO, DuMouchel W, Johnson SB, Clayton PD. Unlocking clinical data from narrative reports: a study of natural language processing. Annals of internal medicine. 1995;122(9):681-8. 836