Validation of the Behavior of a Knowledge Base Implementing Clinical
Guidelines for Point-of-Care Antiretroviral Toxicity Monitoring
William Ogallo, RPh, PhD1, Carol Friedman, PhD1, Andrew S. Kanter, MD, MPH1
1Department of Biomedical Informatics, Columbia University, New York, NY
Abstract
This study investigated the automated detection of antiretroviral toxicities in structured electronic health records data.
The evaluation compared responses generated by 5 clinical pharmacists and 1 prototype knowledge-based application
for 15 randomly selected test cases. The main outcomes were inter-subject dissimilarity of responses quantified by the
Jaccard distance, and the mean proportion of correct responses by each subject. The statistical differences in intersubject Jaccard distances suggested that the prototype was inferior to clinical pharmacists in the detection of possible
antiretroviral toxicity associations from structured data. The reason for dissimilarities was attributable to inadequate
domain coverage by the prototype. The differences in the mean proportion of correct responses between the clinical
pharmacists and the prototype were statistically indistinguishable. Overall, this study suggests that knowledge-based
applications have the potential to support automated detection of antiretroviral toxicities from structured patient
records. Furthermore, the study demonstrates a systematic approach for validating such applications quantitatively.
Introduction
The World Health Organization (WHO) recognizes the need to improve the monitoring of antiretroviral toxicity 1. It
places particular emphasis on underserved settings where the range and patterns of antiretroviral toxicities may alter
the need for and frequency of antiretroviral toxicity monitoring 2. The WHO recommends a symptom-directed
antiretroviral monitoring approach where clinicians the assess signs and symptoms reported by patients and
subsequently draw conclusions about antiretroviral toxicities 1, 3. However, the labor intensity associated with
gathering and analyzing data using the symptom-directed approach limits its utility in settings that face health
workforce challenges 4, 5. Consequently, it is essential to develop strategies that improve the quality of the symptomdirected antiretroviral toxicity monitoring approach. Empirical research suggests that the use of electronic point-ofcare clinical decision support (CDS) tools in the management of HIV improves adherence to clinical guidelines and
makes the collection, analysis, and interpretation of data easier 6, 7. However, there is little evidence to demonstrate
how such tools could be applied to improve symptom-directed monitoring of medication safety within HIV care
workflows. Furthermore, research in the development and application of standard definitions, formatting, and
reporting associated with symptom-directed monitoring of antiretroviral toxicity is limited.
Knowledge-based CDS systems could potentially improve the quality of point-of-care antiretroviral toxicity
monitoring. The term ‘knowledge base’ refers to a repository of facts, heuristics, and models that represent domain
knowledge that can be used for problem-solving and analysis of organized data 8. Consequently, a knowledge-based
system is a software application that uses the knowledge stored in its knowledge base to analyze problems and provide
advice within a restricted domain and like human domain experts 9. Unlike mathematical or statistical approaches that
use numerical representation and arithmetic manipulation to quantitatively model relationships that support inferences
in a given domain, knowledge-based approaches rely on symbolic manipulations that use ontologies and apply logic
to draw conclusions from asserted facts 9, 10. This reasoning is exemplified by the deduction of clinical diagnoses from
observations of symptoms and laboratory findings 9, 10. Knowledge-based systems are primarily developed and used
to increase the reproducibility, scalability, and accessibility to complex reasoning and decision-making tasks 11. Within
the biomedical domain, knowledge-based systems are foundational applications that have remained popular to date 9,
11, 12
. An excellent example of a knowledge-based system is the MYCIN system. This rule-based computer-assisted
decision support system was developed in 1976 by Ted Shortliffe et al. to support inference on the selection of
antibiotic therapy for patients with bacterial infections 9, 12. Currently, knowledge-based systems are applied in several
biomedical domains including but not limited to clinical decision support systems, surveillance in public health
datasets, and hypothesis generation in large-scale research datasets 11.
During the development of a knowledge-based application and before it is deployed for actual use, it is essential to
ensure that its structure and behavior are free of detrimental design flaws 13. Validation studies assess the quality of a
knowledge-based application by examining the functional completeness and the predictive accuracy of its knowledge
base 14. These evaluations assess whether the knowledge base satisfactorily represents domain knowledge and whether
827
non-design experts (domain experts who did not participate in the development of the knowledge-base application)
agree that the information, rules, and procedures in the knowledge base are complete and accurate 14. While structural
validation evaluates the similarities in how a knowledge base and non-design experts conceptualize and structurally
represent knowledge, behavioral validation uses test cases to evaluate the similarity and compare the accuracy of
outputs made by the knowledge base and by non-design experts 13, 14.
This paper describes the validation of the behavior of a knowledge-based application prototype that implements
standard clinical guidelines for the point-of-care monitoring of antiretroviral toxicities. The goal of our analysis was
to ascertain that the prototype generates patient-specific antiretroviral toxicity reports that are sufficiently accurate for
clinical use. Specifically, we evaluated the similarity and accuracy of antiretroviral toxicity reports generated by the
prototype compared to non-design human experts for a random sample of test cases. This paper reports our findings.
Methods
Generation of Antiretroviral Toxicity Summary Reports
We developed a knowledge-based application prototype intended to facilitate the documentation and analysis of
antiretroviral toxicity data within ambulatory HIV care workflows. The prototype generates patient-specific summary
reports that describe possible antiretroviral toxicities and their risk factors detected from electronic health records
(EHR) data. The core content of the prototype’s knowledge base were derived from standard care guidelines and FDAapproved drug labels, and pertain to the major types of antiretroviral toxicities described in the WHO guidelines on
the use of antiretroviral drugs 1. Table 1 lists these antiretroviral toxicities. The prototype organizes medication,
regimen, and toxicity domain knowledge in a manner that support reasoning through the traversal of the relationships
described in its knowledge base. Similar to diagnostic decision support systems 15, clinicians can use the reports
generated by the prototype to confirm or rule out antiretroviral toxicities experienced by individual patients, and if
necessary conduct additional assessments to narrow down diagnoses.
The prototype’s detection of antiretroviral toxicities is based on the “Possible” causal category of the World Health
Organization-Uppsala Monitoring Center system for standardized causality assessment 16. This criterion requires the
ascertainment of a reasonable temporal association between medication administration and the occurrence of toxicity.
The prototype functions as follows. First, given a patient identifier, the prototype queries longitudinal EHR data to
select the list of medications that constitute the patient’s active antiretroviral regimen and the dates when each drug
was first prescribed. Next, the prototype queries the longitudinal EHR data to select the patient’s clinical observations
and the dates when each observation was made. Subsequently, the prototype creates tuples consisting of the
medications, clinical observations, and the date difference between the date when the clinical observation was recorded
and the date when the antiretroviral drug was ordered. It then matches the selected tuples to relationships between
medications, clinical observations, and predetermined time frames that are defined in its knowledge base. In so doing,
it identifies the tuples in which the medications and clinical observations have temporal relationships that suggest
possible antiretroviral toxicities. Lastly, the prototype matches the identified tuples with the antiretroviral toxicity
concept-concept relationships in its knowledge base to generate a list of possible antiretroviral toxicities as output.
Consequently, if the prototype finds the medication abacavir with the recording date 2017-01-14 and the observation
rash with the recording date 2017-01-25, it generates the output abacavir hypersensitivity since rash is a manifestation
of hypersensitivity due to abacavir.
The detection of possible risk factors proceeds similarly. However, some risk factor observations do not require
temporal association with the administration of medications. For such risk factors, the prototype select lists of active
medications and the list of observations and matches these to the antiretroviral toxicity risk factor relationships in its
knowledge base regardless of the dates when they were recorded. For example, if the medication nevirapine is
identified as active and the observation female gender is also identified, then the prototype generates the output female
gender is a risk factor of nevirapine hepatoxicity.
Study Design
This behavioral validation study compared the detection of antiretroviral toxicities, risk factors, and toxicity
observations (symptoms, signs, and laboratory findings) from structured data by non-design human experts and a
prototype knowledge-based application. Specifically, this study evaluated the similarity and the accuracy of reports
generated by 5 clinical pharmacists and the prototype for 15 random test cases. The comparisons were conducted in
an open domain in which the universe of possible responses was not controlled, and in a restricted domain in which
possible responses were constrained to the knowledge content available in the prototype’s knowledge base.
828
Table 1. Major antiretroviral toxicities described in the WHO HIV guidelines (2016)
ARV
Toxicity
Abacavir
• Hypersensitivity reaction
Atazanavir/r
• Electrocardiographic abnormalities (PR and QRS interval prolongation)
• Indirect hyperbilirubinemia (clinical jaundice)
• Nephrolithiasis
Zidovudine
•
•
•
•
•
Severe anemia, neutropenia
Lactic acidosis or severe hepatomegaly with steatosis
Lipoatrophy
Lipodystrophy
Myopathy
Darunavir/r
•
•
•
•
Hepatotoxicity
Severe skin and hypersensitivity reactions
Hepatotoxicity
Hypersensitivity reactions
Dolutegravir
Efavirenz
• Persistent central nervous system toxicity (such as dizziness, insomnia, abnormal dreams) or
mental symptoms (anxiety, depression, mental confusion)
• Convulsions
• Hepatotoxicity
• Severe skin and hypersensitivity reactions
• Gynecomastia
• Severe skin and hypersensitivity reactions
Lopinavir/r
•
•
•
•
•
•
•
Nevirapine
Electrocardiographic abnormalities (PR and QRS interval prolongation, torsades de pointes)
Hepatotoxicity
Pancreatitis
Dyslipidaemia
Diarrhea
Hepatotoxicity
Severe skin rash and hypersensitivity reaction, including Stevens-Johnson syndrome
Raltegravir
• Rhabdomyolysis, myopathy, myalgia
• Hepatitis and hepatic failure
• Severe skin rash and hypersensitivity reaction
Tenofovir
•
•
•
•
Chronic kidney disease
Acute kidney injury and Fanconi syndrome
Decreases in bone mineral density
Lactic acidosis or severe hepatomegaly with steatosis
Study Procedure
The procedure used in the study was loosely based on the framework for validation of rule-based systems by Knauf
et al. 17 and was in concordance with standard procedures for evaluating knowledge bases 14. The Knauf framework
describes a process involving the generation of test scenarios and the use of a Turing Test-like approach to evaluating
the responses of a rule-based system to the test scenarios 17. The key steps applied in this study were test case
generation, test case presentation and experimentation, and data analysis. These steps are described below.
a) Test case generation
The first step of the behavioral comparisons was the creation of test cases. In this study, the test cases were derived
from raw data in published case reports on antiretroviral toxicities. In October 2016, a literature search was conducted
to retrieve published case reports on antiretroviral toxicities. The case reports were identified by electronically
829
searching the Ovid Medline® database. The search strategy involved the use of medical subject heading (MeSH)
terms and search strings associated with the antiretroviral toxicities of interest and was limited to case reports having
abstracts and published in English between the year 2000 and 2016. Table 2 lists the queries used to obtain the case
reports. A total of 114 case reports were identified out of which 6 duplicates were removed.
Four reviewers independently reviewed the titles and abstracts of the case report articles retrieved from the search.
Each article was independently reviewed by two reviewers, and each reviewer reviewed 54 articles. A fifth reviewer
reviewed all the 108 articles and acted as a tie-breaker during the selection of the articles. The goal of the review was
to identify antiretroviral toxicity case reports in which the responsible medication, as well as the patient biodata, signs,
symptoms and laboratory findings, were reported. Reviewers were asked to include an article if and only if an adverse
drug reaction was reported or described, the culprit drug was mentioned, and the reported case was about HIV/AIDS.
They were also asked to identify the case reports that described patient characteristics such as age, gender, and weight
as well as signs, symptoms, and laboratory findings. The reviewers were asked to exclude case reports that were solely
about the use of antiretroviral medications for the management of hepatitis infections, reports that only addressed
treatment efficacies and reports that were about genetics, tumors, or immunotherapy.
Table 3 shows the consensus between pairs of reviewers who reviewed the same case reports estimated using percent
agreement and Cohen’s Kappa. The ratings from reviewer 1 were dropped based on the high rate disagreements with
the other reviewers. A total of 62 cases were identified from the 55 articles that were eventually included in the study.
The 62 cases, available as raw textual narratives, were structured and annotated to enable input and analysis by our
prototype. The annotation was done using the National Center for Biomedical Ontology (NCBO) Annotator. The
NCBO Annotator is an ontology-based web service for annotating raw texts with ontology concepts from several
biomedical terminology vocabularies in the Unified Medical Language (UMLS) Metathesaurus and the NCBO
Bioportal repositories 18, 19. For example, the text “A female patient using Amoxicillin complained of Rash” would be
annotated with several UMLS concepts including female (CUI C0086287), amoxicillin (CUI C0002645), rash (CUI
C0015230). The annotation process was done via the NCBO Annotator’s Representational State Transfer (REST)
Web Service, with the ontology sources restricted to RxNorm, MEDDRA, and LOINC.
The resulting annotations for each of the 62 cases were manually grouped into 5 categories: descriptive characteristics
(e.g., age, weight, gender), comorbidities, medications, signs/symptoms/findings, and laboratory test results. Two
reviewers independently reviewed the structured annotations for each case. The goals of this review were to counter
check if the annotated concepts were indeed present in the raw text of the case, to identify redundant and synonymous
concepts, to add concepts that were not identified by the annotator, and to fill in numeric values and reference ranges
for concepts that had numeric values. The two reviewers compared their reviews for each case with discrepancies
solved by consensus after confirmation with the raw text for the case in question. A stratified random sampling
procedure was used to select the 15 test cases that were presented to the prototype and the experts in the study.
Stratification by the type of antiretroviral medications and the type of antiretroviral drug toxicities was used to
minimize sample selection bias.
b) Test case presentation and experimentation
Each test case was described as a pair of input test data and the corresponding output responses. Figure 1 illustrates
an example of a test case. The input test data for a given test case was comprised of structured lists of the biodata,
comorbidities, medications, signs/symptoms/findings, and laboratory test results. The input data was presented in two
formats. In the first format, the test data was presented as observations in an OpenMRS database (MySQL) to enable
analysis by the prototype. In the second format, the input data for a given test case was presented as a structured
clinical vignette for the human experts using Google Forms. The output for a given test case was defined as lists of 1)
Possible antiretroviral toxicities, 2) Possible antiretroviral toxicity risk factors, and 3) Possible antiretroviral toxicity
observations (signs, symptoms, and laboratory results).
All the 15 selected test cases were processed by the prototype and 5 clinical pharmacists who did not participate in
the development of the prototype. The prototype processed the input data by querying its knowledge base and
generating lists of 1) Ingredient-Toxicity pairs, 2) Ingredient-Toxicity-Risk Factor triples and 3) Ingredient-ToxicityToxicity Observation triples. The human experts processed the input data by selecting choices to three multi-answer
questions about each test case: 1) What antiretroviral toxicities could plausibly be identified from the case above? 2)
What antiretroviral toxicity risk factors could plausibly be identified from the case above? 3) What antiretroviral
toxicity manifestations could plausibly be identified from the case above?
830
Table 2. Terms and strategy for literature search in Ovid Medline
Search
1
Results
5
(Abacavir or ABC or Atazanavir or ATV or "ATV/r" or Dolutegravir or DTG or Darunavir or
DRV or "DRV/r" or Efavirenz or EFV or Etravirine or ETV or ETR or Lopinavir or LPV or
"LPV/r" or Nevirapine or NVP or Raltegravir or RAL or Tenofovir or TDF or Zidovudine or ZDV
or AZT).ti.
("Drug-Related Side Effects and Adverse Reactions" or "Acidosis, Lactic" or "Acute kidney
injury" or "Bone density" or "Drug Hypersensitivity" or "Drug-Induced Liver Injury" or "Fanconi
syndrome" or "Fatty Liver" or "Heart Conduction System/abnormalities" or "Muscular Diseases"
or "Renal Insufficiency, Chronic" or "Sleep Initiation and Maintenance Disorders" or Anemia or
Anxiety or Confusion or Depression or Diarrhea or Dizziness or Dreams or Dyslipidemias or
Gynecomastia or Hepatomegaly or Hyperbilirubinemia or Jaundice or Lipodystrophy or
Nephrolithiasis or Neutropenia or Pancreatitis or Rhabdomyolysis or Seizures).sh.
("adverse drug reaction" or "adverse reaction" or "adverse drug event" or "adverse reaction" or
"adverse event" or "toxicity" or allerg$ or "Abnormal Dreams" or "Acute kidney failure" or "Acute
kidney injury" or "Acute renal failure" or "An?emia" or "Bone density" or "bone mineral density"
or "breast enlargement" or "Central Nervous System Toxicity" or "Chronic Kidney Disease" or
"Chronic Kidney Failure" or "Chronic Kidney Insufficiency" or "Chronic Renal Disease" or
"Chronic Renal Failure" or "Chronic Renal Insufficiency" or "Drug-Induced Liver Injury" or
"Electrocardiographic abnormalities" or "Enlarged Liver" or "Fanconi syndrome" or "Fatty Liver"
or "Heart Conduction disorder" or "Hepatic failure" or "Hepatic Injury" or "Hepatic toxicity" or
"Hepatomegaly" or "Hyperbilirubin?emia" or "Icterus" or "Jaundice" or "Kidney Stone" or
"Kidney Stones" or "Lactic Acidosis" or "Liver Enlargement" or "Liver failure" or "Liver injury"
or "Liver toxicity" or "loose bowel movement" or "Mental symptoms" or "Muscular Disease" or
"Nephrolithiasis" or "Neutrop?enia" or "PR interval prolongation" or "QRS interval prolongation"
or "QT interval prolongation" or "Renal colic" or "Renal Lithiasis" or "Skin reaction" or
"Steatosis" or allerg$ or Anxiety or Cholesterol or Cholesterol?emia or Confusion or Convulsion?
or Depression or Diarrh?ea or Dizziness or Dyslipidemia or Eruptions or Gyn?ecomastia or
Hepatitis or Hepatotoxicity or Hypercholesterol?emia or Hypersensitivity or Hypertriglycerid?emia
or Insomnia or Lipoatrophy or Lipodystrophy or Myalgia or Myopathy or Pancreatitis or Rash or
reaction or Rhabdomyolysis or Seizure? or Triglycerid?emia or Triglycerides).ti,ab,kw.
(Didanosine or ddI or Stavudine or d4T or Saquinavir or SQV or Indinavir or IDV or Tipranavir or
TPV or Fosamprenavir or FPV or Rilpivirine or RPV or Cobicistat or COBI or Elvitegravir or
EVG or Pharmacokinetics or Pregnancy or "Postpartum Period" or Postpartum or Infant or Child or
"in vitro" or Prophylaxis or transplant or Transplantation or neonate or "chronic hepatitis B" or
Efficacy).ti.
animal/ not (human/ and animal/)
6
(1 and (2 or 3)) not (4 or 5)
7
limit 6 to (abstracts and english language and "case reports" and yr="2000 - 2016")
114
8
remove duplicates from 7
108
2
3
4
15423
593315
2315763
4470056
4285612
2137
Table 3. Inter-rater reliability between reviewers
Raters
Percent
Kappa
rater1 & rater2
61.1
0.2
rater1 & rater5
70.4
0.4
rater2 & rater5
90.7
0.8
rater3 & rater4
92.6
0.8
rater3 & rater5
92.6
0.8
rater4 & rater5
96.3
0.9
831
Figure 1. Example of the input data (blue) and output data (green) that constitute a Test Case used in the study.
c) Data Evaluation
The evaluation methodology applied in this study was leveraged from Hripcsak et al.’s foundational work on
evaluating the automated detection of clinical conditions from narrative reports using natural language processing 20.
As previously described, the evaluation entailed comparing responses generated by 5 human experts and 1 prototype
for 15 randomly selected test cases. An additional algorithm that randomly guessed responses with 50% chance of
getting the correct answer (based on a majority vote by the experts) was added for comparison. The primary outcome
of the behavioral evaluation in this study was the pairwise inter-subject judgmental dissimilarity quantified by the
Jaccard distance. Explicitly, this distance was defined as one minus the number of response elements in common
between the sets of responses by a subject 𝑗 and a subject 𝑘 divided by the number of response elements by the two
subjects for a given test case 𝑖 as described in the equation below. The Jaccard distance has a range 0 ≤ 𝑑𝑖𝑗𝑘 (𝑋, 𝑌) ≤
1 with a higher value implying greater dissimilarity.
𝑑𝑖𝑗𝑘 (𝑋𝑖𝑗 , 𝑋𝑖𝑘 ) = 1 −
|𝑋𝑖𝑗 ∩ 𝑋𝑖𝑘 |
|𝑋𝑖𝑗 ∪ 𝑋𝑖𝑘 |
The average Jaccard distance between each pair of subjects was computed as the mean Jaccard distance across all the
15 test cases in the study. For each expert, the mean Jaccard distance from the other 4 experts was computed. For nonexpert subjects, the mean Jaccard distance from all the 5 experts was computed. The research hypothesis that the mean
Jaccard distance to the group of experts was different for at least one of the subjects was tested using analysis of
variance.
The secondary outcome of the behavioral evaluation was the proportion of responses by each subject that were correct,
relative to a reference standard based on the majority opinion of the experts. The correctness of responses for a given
test case 𝑋𝑖 was defined as the number of responses in common between a subject 𝑗 and the reference standard 𝑘
divided by the number of responses by the subject:
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑛𝑒𝑠𝑠 =
|𝑋𝑖𝑗 ∩ 𝑋𝑖𝑘 |
𝑋𝑖𝑗
Analogous to the dissimilarity evaluation, an expert’s reference standard was based on the majority vote of the
remaining 4 experts, while the prototype’s and random guessing reference standard was based on the majority vote of
all the 5 experts. The research hypothesis that there was a difference in the mean correctness of the prototype and the
mean correctness of the experts was tested using analysis of variance. In addition to the correctness evaluations,
pairwise Kappa statistics were determined to assess consistency in responses between the subjects and the reference
standards used.
832
Results
The 5 experts and the prototype generated 314 unique responses from the 15 test cases. Of these responses, 66 were
about antiretroviral toxicities, 109 were about risk factors, and 139 were about toxicity observations. Based on the
majority opinion of the experts, 199 responses (63%) were considered correct. Of the 314 responses generated, 70
responses (22%) did not exist as evidence in the prototype’s knowledge base.
The mean Jaccard distances of each subject from the experts are illustrated in Figure 2. The comparisons between the
mean Jaccard distances of each subject and the mean Jaccard distance of the experts are provided in Table 4. Although
the experts differed in their interpretation of the test cases for at least 25% of the time, the differences in the Jaccard
distances of the experts from each other was not statistically significant. This observation was confirmed by a Fleiss’s
Kappa score of 0.77 that indicated substantial agreement among the 5 experts. When all the responses were accounted
for (unrestricted), the mean Jaccard distance of the experts from each other was 0.312 (95% CI, 0.283 to 0.342), while
the mean Jaccard distance of the prototype from the experts was 0.424 (95% CI, 0.382 to 0.466). The difference
between these two distances was 0.112 (0.06 to 0.163, p-value <0.001) suggesting statistically significant differences
between responses by experts and by the prototype. However, the distance of the prototype and the experts was smaller
than the distance between the prototype and random guessing at 50% chance of being correct (Figure 2 - unrestricted).
Interestingly, restricting the universe of responses (by ignoring the 70 responses that did not exist in the knowledge
base for all subjects) resulted in the difference between the distance of the prototype from the experts and the average
distance among the experts becoming statistically indiscernible. The removal of the responses did not appear to
significantly affect the distances of the other subjects from the experts (Figure 2 - restricted).
Figure 3 illustrates the means and 95% confidence intervals for the proportion of correct responses by a subject. When
responses were unrestricted, the mean correctness of the prototype across all test cases was 79.5% (95% CI, 71.9 to
87.2). Based on inspection of the confidence interval overlaps and on the one-way ANOVA model using all subjects,
there was insufficient evidence to conclude that difference between the mean correctness of the prototype and the
mean correctness of the human experts was statistically significant (p-value>0.5). Similar conclusions were found
when the responses were restricted. Lastly, a Cohen’s Kappa score of 0.68 indicated moderate agreement between the
prototype’s responses and reference standard responses derived from the majority opinion of the experts. These
observations collectively suggest that the fact that the prototype’s accuracy is equivalent to human expert accuracy
cannot be ruled out.
Discussion
The behavioral validation conducted in this study involved comparing the detection of antiretroviral toxicities, risk
factors, and observations (signs, symptoms, and laboratory findings) by the prototype and by non-design experts for
a random sample of test cases. The findings of this study suggest that the knowledge base of the prototype developed
in this study behaves as human domain experts albeit to a moderate degree.
There was sufficient evidence to conclude that there was a statistical difference in the detection of antiretroviral
toxicities, risk factors, and observations (sign, symptoms, and laboratory findings) from structured data between the
prototype and human experts. Nonetheless, the reports generated by the prototype tended to be more similar to human
expert reports than to reports generated through random guessing. The accuracies of the prototype and the human
experts were indistinguishable.
Interestingly, when the universe of responses was restricted to the knowledge that was available in the prototype, the
dissimilarities between the reports generated by the prototype and the human experts became indistinguishable, while
the dissimilarities among reports generated by the experts remain unchanged. This observation confirms the wellknown assertion that for a knowledge base to be considered functionally complete, it must not only be structured
appropriately and contain accurate knowledge, but it must also have adequate coverage of the domain knowledge 14.
However, as was the case with the development of the prototype in this study, it is not always possible or reasonable
to ensure complete domain coverage particularly in the early stages of the development of knowledge-based
applications. Furthermore, when using standard guidelines as the basis for the content of the knowledge base of
knowledge-based applications, inadequate domain coverage is likely. This is because care guidelines tend to provide
content about key treatment-limiting conditions that are most impactful in clinical care.
833
Figure 2. Mean Jaccard Distance (and 95% Confidence Interval) of Subjects from Experts for unrestricted responses
(top) and restricted responses (bottom)
Table 4. Differences between Mean Subject and Mean Expert Jaccard Distances
Category
Unrestricted
Restricted
Subject
Difference (95% CI)
p-value
Expert 1
-0.025 (-0.081 to 0.03)
0.357
Expert 2
-0.003 (-0.059 to 0.052)
0.904
Expert 3
-0.018 (-0.073 to 0.038)
0.52
Expert 4
0.051 (-0.004 to 0.107)
0.067
Expert 5
-0.005 (-0.06 to 0.05)
0.851
Prototype
Guessing
0.112 (0.06 to 0.163)
0.321 (0.27 to 0.372)
<0.01*
<0.01*
Expert 1
-0.01 (-0.059 to 0.039)
0.685
Expert 2
0.004 (-0.045 to 0.053)
0.88
Expert 3
-0.03 (-0.078 to 0.019)
0.227
Expert 4
0.025 (-0.024 to 0.074)
0.312
Expert 5
0.011 (-0.038 to 0.06)
0.648
Prototype
Guessing
0.037 (-0.009 to 0.082)
0.342 (0.297 to 0.387)
0.108
<0.01*
834
Figure 3. The proportion of Correct Responses (and 95% Confidence Interval) of Subjects from Experts for
unrestricted responses (top) and restricted responses (bottom)
It was also interesting to observe that although the dissimilarities of reports among the experts were statistically
indistinguishable, the proportion of time they disagreed with each other was as high as 25%. This suggests variability
in the manner in which experts interpret antiretroviral toxicity information albeit the fact that no single expert is
significantly different from the others. Hripcsak et al. reported a similar observation among expert physicians
identifying conditions from radiology reports 20. In our study, it is likely that the variability among the human expert
as well as among the original authors of case reports used contributed to the generation of the 70 responses that were
not available as evidence in the prototype’s knowledge-base. It is not clear why the experts in this study interpreted
the reports differently but this could be as a result of local influences and shared experiences that determine how the
experts perceive knowledge about antiretroviral toxicities beyond what is described in standard sources such as
treatment guidelines and drug labels.
This study had several limitations. First, only clinical pharmacists were used as human expert subjects, and there was
no gold standard measure for antiretroviral toxicity. Nonetheless, the clinical pharmacists who participated in the study
were carefully selected and using their majority opinion as the reference for testing the prototype in the absence of a
gold standard was deemed credible. Additional research is, however, needed to extend the findings of this study to
other health workforce cadres and to non-experts. Second, the evaluated application was an initial prototype. It is
possible that as the iterative development of the application continues, future conclusions about its behavior would
change. Third, the study relied on evaluating structured data only. It is possible that solutions investigating the
automated detection of antiretroviral toxicities from unstructured data may report different findings.
Conclusion
Overall, this study suggests that it is possible to implement antiretroviral toxicity domain knowledge in knowledgebased applications successfully and that such applications have the potential to support automated detection of
antiretroviral toxicities from structured patient records. This study also demonstrates a rigorous systematic
methodology for evaluating such applications quantitatively. Future research should delineate novel ways of dealing
with uncertainty and inadequate domain coverage, and in controlling the duration of validation processes when
developing knowledge base applications that implement clinical guidelines. Further research is also needed to
investigate the impact of the variability among expert subjects on the outcomes of studies investigating the behavior
of knowledge-based applications. Additionally, the knowledge-based approach applied in this study could be
835
investigated further to support surveillance and hypothesis generation in large-scale public health and research
datasets.
Acknowledgments
We thank Drs. Dorothy Aywak, Prashant Mandalya, Seema Shah, Jilna Shah, and Lisper Njeri for reviewing the case
reports in this study, and Drs. Imran Manji, Wilson Irungu, David Wanje, Benson Njuguna, and Dennis Thirikwa for
serving as non-design evaluation experts. We also thank Drs. Gilad Kuperman, Lena Mamykina, and Martin Were for
providing feedback on this research, and Dr. George Hripcsak for supporting the completion of the study financially.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
World Health Organization. Consolidated guidelines on the use of antiretroviral drugs for treating and
preventing HIV infection: recommendations for a public health approach. 2016.
World Health Organization. March 2014 supplement to the 2013 consolidated guidelines on the use of
antiretroviral drugs for treating and preventing HIV infection: recommendations for a public health approach.
2014.
World Health Organization. Antiretroviral therapy of HIV infection in infants and children: towards universal
access: recommendations for a public health approach-2010 revision: World Health Organization; 2010.
King RC, Fomundam HN. Remodeling pharmaceutical care in Sub-Saharan Africa (SSA) amidst human
resources challenges and the HIV/AIDS pandemic. The International journal of health planning and
management. 2010;25(1):30-48.
Hawthorne N, Anderson C. The global pharmacy workforce: a systematic review of the literature. Human
resources for health. 2009;7:48.
Were MC, Nyandiko WM, Huang KT, Slaven JE, Shen C, Tierney WM, et al. Computer-generated reminders
and quality of pediatric HIV care in a resource-limited setting. Pediatrics. 2013;131(3):e789-96.
Oluoch T, Santas X, Kwaro D, Were M, Biondich P, Bailey C, et al. The effect of electronic medical recordbased clinical decision support on HIV care in resource-constrained settings: a systematic review. International
journal of medical informatics. 2012;81(10):e83-92.
Shortliffe EH, Barnett GO. Biomedical data: Their acquisition, storage, and use. Biomedical informatics:
Springer; 2014. p. 39-66.
Musen MA, Middleton B, Greenes RA. Clinical decision-support systems. Biomedical informatics: Springer;
2014. p. 643-74.
Rubin DL, Greenspan H, Brinkley JF. Biomedical imaging informatics. Biomedical Informatics: Springer;
2014. p. 285-327.
Payne PR. Chapter 1: Biomedical knowledge integration. PLoS Comput Biol. 2012;8(12):e1002826.
Wraith SM, Aikins JS, Buchanan BG, Clancey WJ, Davis R, Fagan LM, et al. Computerized consultation
system for selection of antimicrobial therapy. Am J Hosp Pharm. 1976;33(12):1304-8.
Guida G, Mauri G. Evaluating performance and quality of knowledge-based systems: foundation and
methodology. IEEE Transactions on Knowledge and Data Engineering. 1993;5(2):204-24.
Adelman L, Riedel SL. Handbook for evaluating knowledge-based systems: Conceptual framework and
compendium of methods: Springer Science & Business Media; 2012.
Miller RA. Diagnostic decision support systems. Clinical decision support systems: Springer; 2016. p. 181208.
World Health Organization. Uppsala Monitoring Centre. The use of the WHO-UMC system for standardised
case causality assessment. 2014.
Knauf R, Gonzalez AJ, Abel T. A framework for validation of rule-based systems. IEEE transactions on
systems, man, and cybernetics Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics
Society. 2002;32(3):281-95.
Jonquet C, Shah N, Youn C, Callendar C, Storey M-A, Musen M, editors. NCBO annotator: semantic
annotation of biomedical data. International Semantic Web Conference, Poster and Demo session; 2009.
Jonquet C, Shah N, Musen M, editors. The open biomedical annotator. AMIA summit on translational
bioinformatics; 2009.
Hripcsak G, Friedman C, Alderson PO, DuMouchel W, Johnson SB, Clayton PD. Unlocking clinical data from
narrative reports: a study of natural language processing. Annals of internal medicine. 1995;122(9):681-8.
836