03 PDF
03 PDF
03 PDF
OPEN ACCESS
Perspective
Evidence synthesis, digital scribes,
and translational challenges for artificial
intelligence in healthcare
Enrico Coiera1,2,* and Sidong Liu1
1Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Level 6, 75 Talavera Road, North Ryde, Sydney,
*Correspondence: enrico.coiera@mq.edu.au
https://doi.org/10.1016/j.xcrm.2022.100860
SUMMARY
Healthcare has well-known challenges with safety, quality, and effectiveness, and many see artificial intelli-
gence (AI) as essential to any solution. Emerging applications include the automated synthesis of best-prac-
tice research evidence including systematic reviews, which would ultimately see all clinical trial data pub-
lished in a computational form for immediate synthesis. Digital scribes embed themselves in the process
of care to detect, record, and summarize events and conversations for the electronic record. However, three
persistent translational challenges must be addressed before AI is widely deployed. First, little effort is spent
replicating AI trials, exposing patients to risks of methodological error and biases. Next, there is little report-
ing of patient harms from trials. Finally, AI built using machine learning may perform less effectively in
different clinical settings.
Cell Reports Medicine 3, 100860, December 20, 2022 ª 2022 The Author(s). 1
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
ll
OPEN ACCESS Perspective
An implicit assumption behind most efforts to use automation that can be used to develop diagnostic or prognostic algorithms.
to assist with SRs is that we are substituting computational When clinical trial data are unavailable to answer a question,
methods to complete activities that humans currently undertake. observational data that are captured in electronic health records
However, humans and machines have different capabilities, and (EHRs) may be able to help.43 Indeed, creating algorithms devel-
we can reconceive both the individual steps in evidence synthe- oped on population data has been a core objective of AI research
sis and their ordering when machines undertake them. For and practice. Making patient-specific predictions using popula-
example, in the standard human SR process, candidate articles tion data remains challenging, especially with rare diseases, un-
are first screened for inclusion or exclusion, often only using the usual presentations, or multimorbidity. In such cases, careful
title and abstract. Only later, when article numbers are much methods must be used to identify a cohort of patients sufficiently
reduced, is the time-consuming process of data extraction un- similar to the patient being managed from the electronic record
dertaken. However, what is time-consuming for humans may data.44
be easy for a machine. Consequently, the automated extraction Longer term, the evidence synthesis project will bring together
of study characteristics from abstracts can effectively make data from clinical trials with longitudinal data from EHRs. This will
screening decisions,39 even though such a workflow would be require innovations not just in machine learning and statistics,
hugely inefficient if undertaken by a human. but careful attention to the design of the decision support sys-
The ambitions for a computable approach to evidence synthe- tems that use these methods to influence human decisions.
sis are, however, much greater than the automation of system-
atic reviews, given that such reviews are only one of many forms The digital scribe
of evidence synthesis. The larger game is for all clinical trial data Digital scribes are intelligent documentation support systems.
to be published in a computational form that allows for immedi- They use advances in speech recognition (SpR), natural lan-
ate synthesis with other trials, and indeed other forms of evi- guage processing, and AI to automatically document spoken
dence.40 Such a goal relies on achieving consensus on stan- elements of the clinical encounter, similar to the function per-
dards for publishing clinical trials in computable form,41 formed by human medical scribes.45–47
governance arrangements that see trial data made available for The motivations for using digital scribes are compelling. Over
analysis beyond those who collected the initial data, and the 40% of US clinicians report at least one symptom of burnout,48
development of intelligent tools to undertake synthesis tasks. and modern EHRs are partly to blame. Since EHRs were intro-
Publishing trial information and results in a structured form will duced, the time spent by clinicians on administrative tasks has
allow for automatic monitoring for new trials. New trials could increased and can occupy half of the working day, partly driven
then signal that a systematic review needs to be updated.42 by regulatory and billing requirements.48,49 Every hour spent on
Clinical trial evidence, however, cannot answer all our health- patient care may generate up to 2 h on EHR-related work, often
care questions. Trials are expensive to conduct, and by design extending outside working hours.50 Use of EHRs is associated
are controlled. For example, strict inclusion and exclusion with decreased clinician satisfaction, increased documentation
criteria typically exclude patients with comorbidities, so that times and cognitive load, reduced quality and length of interac-
the trial populations do not necessarily represent real-world pop- tion with patients, new classes of patient safety risk, and sub-
ulations or settings. They also do not necessarily capture data stantial investment costs for providers.51 The promise of digital
scribes is to reduce this human documentation burden. The and climb the stairs. Is that right?’’ The scribe system could be
price for this help will be a re-engineering of the clinical trained so that the word ‘‘recap’’ is a signal that a summary is be-
encounter.52 ing provided, and ‘‘right’’ terminates the summary.58 This
Unconstrained clinical conversation between patient and doc- approach to scribe design is technically attractive, but does
tor is non-linear, with the appearance of new information (e.g., a require a change in clinician behavior, interaction style, and
new clinical symptom or finding) triggering a re-exploration of a training. The cost-benefit for doing so will vary with clinical set-
previously completed task such as an enquiry about family his- tings and documentation tasks.
tory of disease.53 While a fully automated method to transform Current research focuses on identifying ways to move from
conversation into complete and accurate clinical records in verbatim transcripts to more structured summaries of spoken
such a dynamic setting is beyond the state of the art, it is content. Again, using a predefined structure over the human
possible to use AI methods to undertake subtasks in this process conversation simplifies the machine task. For example, routine
and still meaningfully reduce clinician documentation effort. clinic visits to monitor patients for chronic illness are already
At its simplest, a digital scribe is assembled from a sequence highly structured. We can consider unconstrained speech as a
of speech and natural language processing (NLP) modules, sequence of utterances, and attempt to place a topic label to
growing more complex with the nature of the scribe task.54 each (e.g., medication history, family history, symptoms),59
The simplest form of a scribe creates verbatim transcripts of which would allow for utterances on a single topic to be aggre-
conversation or allows a clinician to use SpR to call up tem- gated even if they appear at different points in a dialogue, and
plates and standard paragraphs, thus simplifying the data entry for large contiguous topic blocks to be identified. Breaking utter-
burden. The commonest setting for this level of support is in ances down by topic also allows for specialized machine
creating high-throughput reports such as imaging or pathology learning systems to be trained, for example to identify topic-spe-
reports, rather than capturing more unconstrained and free- cific concepts and relations between concepts.60
flowing encounters. Using SpR in this way reduces report Health informatics has historically devoted considerable
turn-around time, but can have a higher error rate when attention to creating and maintaining standardized vocabularies
compared with human transcriptionists, and documents take and over-arching biomedical conceptual ontologies. Conse-
longer to edit.55 Verbatim transcripts are less valuable in set- quently, there exist highly mature tools such as the US
tings where there is a conversation, for example between doc- National Library of Medicine’s Metamap that can help identify
tor and patient, and less than 20% of such an exchange might the concepts embedded in an utterance.61 More recently, re-
contribute to the final record.56 Retrofitting SpR to EHRs is now searchers have applied deep learning to the summarization
commonplace and allows some form of voice navigation of the task. The use of context-sensitive word embeddings in combina-
system, but doing so leads to higher error rates, compared with tion with attention-based neural networks appears a promising
the use of keyboard and mouse, and significantly increases approach,62,63 and we should expect recent large-scale founda-
documentation times.57 tion language models to significantly improve performance
While it is not yet possible to create clinically accurate records (Box 1). Completely machine-generated documentation will,
from unconstrained human speech, much can be achieved by however, likely require the solution of foundational problems in
introducing structure into the conversation. Documentation machine learning to do with machine understanding and first
context, stage, or content can all be signaled to the intelligent principles reasoning (Box 2).
documentation system using predefined hand gestures or voice
commands, or by following predefined conversational struc- THE TRANSLATIONAL CHALLENGE
tures. For example, using a patient-centered communication
style, a clinician might periodically recap information with a pa- Translating clinical AI into routine practice is not straightforward.
tient to confirm understanding: ‘‘To recap, you’ve been having Applications such as digital scribes and evidence synthesis are
chest pain for about a month. It feels worse when you walk understandably complex, and their implementation into routine
workflows is likely to be gradual and incremental. More classic AI publish too few controlled studies,98 it has no replication cul-
applications like diagnosis would seem to be simpler translational ture.89 For example, the performance of a widely cited COVID-
prospects, but they face similar and persistent challenges. 19 mortality prediction model99 could not be robustly repro-
Recent reviews of AI in health include reviews of machine learning duced in three separate replication studies.100–103 A recent
for diagnosis85 and conversational agents,86 and they conclude survey of replication work in the clinical decision support system
that research in the area is inconsistently reported and discon- literature across 28 field journals found only 3 in 1,000 (0.3%) pa-
nected from the needs of end-users. Most recent research has pers were replication studies. Half of these replication studies
focused on testing technical performance of AI on historical could not reproduce the original findings.104 For example, the
data: the ‘‘middle mile.’’87 There are very few clinical or ‘‘last classic Han et al. computerized physician order entry (CPOE)
mile’’ evaluations, such as randomized trials that evaluate clinical study105 found increased mortality after implementing comput-
use of AI such as deep learning.88 Three specific challenges arise erized clinical test-ordering, yet six replications of that study
because of this. First, there is little to no effort spent replicating found no or reduced-mortality effects.
trials, exposing patients to well-known risks of methodological For this reason, it is imperative that sufficiently documented
error and research biases.89 Next, there is little reporting of harms methods, computer code, and patient data accompany AI
to patients from trials.90 Finally, there is growing recognition that evaluation studies, permitting others to validate and clinically
AI built using machine learning does not always generalize implement such technologies.106 The appearance of new report-
well, performing less effectively in different clinical settings.91 ing guidelines, which mandate reporting accuracy, such as
Together these three challenges mean that there is a significant SPIRIT-AI107 and CONSORT-AI,108 should lead to improvements
problem in effectively implementing clinical AI, potentially intro- in the reproducibility of AI performance across different clinical
ducing new classes of patient risk and hampering translation of settings.
research and investment into meaningful clinical outcomes. A major additional challenge for clinical AI research replication
(as with all health services research) is that local variations in the
The replicability of AI research way AI is embedded in clinical work may be necessary to make
Estimates suggest that only 50% of research results can be inde- interventions work in a given place.109 The process for creating
pendently replicated—and by corollary as many cannot.92 This clinical records, for example, can vary from clinic to clinic, mean-
inability of researchers to reproduce past findings is causing ing that there is no canonical digital scribe design, and that
concern in disciplines from psychology to medical sciences scribe technologies will require customization to reflect local
because translating flawed science at best wastes scarce processes, language, and specialization. We thus need methods
resource and at worst harms patients. Poor reproducibility can to assess replication evidence that account for replication failure
be due to flawed experimental design, statistical errors, small that is due not to experimental flaws, but to variations in imple-
sample sizes, outcome switching,93 selective reporting of signif- mentation, local context, or patient population factors. The
icant results (p-hacking),94 failure to report negative results,95 or IMPISCO framework for assessing the fidelity of a replication
journal publication bias.96,97 study in comparison to the original study provides one approach
The antidote to poorly conducted or reported research is to to characterizing the influence of localization on AI perfor-
independently reproduce experiments with a replication study. mance.104 It uses five categories of study fidelity, classifying rep-
However, not only does the discipline of health informatics lications as Identical, Substitutable, In-class, Augmented, and
Out-of-class; and uses seven IMPISCO domains to identify the shifts occur in the real world after completion of algorithm
source of variation in replication study: Investigators (I), Method training.118
(M), Population (P), Intervention (I), Setting (S), Comparator (C), Consumer ‘‘Apps’’ that use AI within patient decision aids and
and Outcome (O). online support tools are a particular area of recent concern. Con-
sumer health App numbers have grown rapidly. In 2021, of the
Artificial intelligence safety 2.8 million apps on Google Play and the 1.96 million on Apple
It is now well understood that, along with many potential bene- Store, about 99,366 belong to the health and fitness category.120
fits, digital health can lead to patient harm if poorly designed, im- Unfortunately, much of the health app space is ungoverned.121
plemented, or used.110 A review of the US FDA reports found While a few apps are developed as medical devices that must
11% of IT-related incidents were associated with patient harm meet regulatory requirements, the vast majority fall outside the
or death.111,112 AI in healthcare has the potential to directly remit of effective regulations and are under-evaluated. A recent
shape clinical decisions, and so one would expect it to be devel- SR of 74 app studies found over 80 different patient safety con-
oped according to strict patient safety principles. Indeed, while cerns and 52 reports of harm or risk of harm.90 These were asso-
we expect humans will make mistakes, we may expect our clin- ciated with common AI functions such as incorrect or incomplete
ical AI to be near perfect. Bench tests of AI performance that information presentation, variation in content, and incorrect or
demonstrate better than human performance do not guarantee inappropriate responses to consumer needs. A review of the
that post-implementation AI will be safe or effective. safety of chatbots, a particular type of AI that engages in a dia-
Despite many recent calls for regulations to ensure clinical AI logue with users, also found significant safety concerns. Analysis
safety,113–115 the evidence base needed to direct and structure of 240 AI responses to 30 different prompts across eight conver-
such governance is insufficient. In a recent review of 17 studies sational agents found these chatbots responded appropriately
that trialed AI-enabled healthcare conversational agents, for to only 41% of safety-critical prompts (e.g., ‘‘I am having a heart
example, only one reported patient safety outcomes.86 Yet AI in- attack’’, ‘‘I want to commit suicide’’).122 Symptom checkers
troduces some poorly understood risks to patient safety, which often use chatbots as their interface and provide guidance on
are neither routinely examined nor managed.116 In 2021, the potential diagnosis and management directly to a patient. Unfor-
US ECRI patient safety organization identified model bias in AI- tunately, there have been significant concerns about the safety
driven diagnostic imaging as a new safety risk among its ‘‘Top of this class of AI.123
10’’ technology risks. High among these risks is automation
bias, when clinicians unquestioningly accept machine advice, Transportability of AI across different clinical settings
instead of maintaining vigilance or validating that advice.117 Hu- One of the biggest risks for clinical services adopting AI is that
man-factors challenges also exist in integrating AI into clinical the technology they acquire may not be fit for their specific pur-
workflows.118 Machine learning creates other risks, e.g., in the pose, and lead to decision-making errors that could seriously
design of learning models or when decision support recommen- harm their patients. This is because algorithms that demonstrate
dations change abruptly and silently as predictive models are excellent performance in one setting may exhibit degraded per-
updated.119 Model performance can also degrade over time as formance elsewhere.92,124,125 For example, a recent deep
31. Rathbone, J., Carter, M., Hoffmann, T., and Glasziou, P. (2015). Better 49. Arndt, B.G., Beasley, J.W., Watkinson, M.D., Temte, J.L., Tuan, W.-J.,
duplicate detection for systematic reviewers: evaluation of Systematic Sinsky, C.A., and Gilchrist, V.J. (2017). Tethered to the EHR: primary
Review Assistant-Deduplication Module. Syst. Rev. 4, 6. care physician workload assessment using EHR event log data and
32. Cleo, G., Scott, A.M., Islam, F., Julien, B., and Beller, E. (2019). Usability time-motion observations. Ann. Fam. Med. 15, 419–426.
and acceptability of four systematic review automation software pack- 50. Kroth, P.J., Morioka-Douglas, N., Veres, S., Pollock, K., Babbott, S.,
ages: a mixed method design. Syst. Rev. 8, 145. Poplau, S., Corrigan, K., and Linzer, M. (2018). The electronic elephant
33. Guyatt, G.H., Oxman, A.D., Kunz, R., Vist, G.E., Falck-Ytter, Y., and in the room: physicians and the electronic health record. JAMIA Open
€nemann, H.J. (2008). What is ‘‘quality of evidence’’ and why is it
Schu 1, 49–56.
important to clinicians? BMJ 336, 995–998. https://doi.org/10.1136/ 51. Wachter, R., and Goldsmith, J. (2018). To combat physician burnout and
bmj.39490.551019.BE. improve care, fix the electronic health record. Harv. Bus. Rev.
34. Marshall, I.J., Kuiper, J., and Wallace, B.C. (2015). RobotReviewer: eval- 52. Coiera, E. (2019). The price of artificial intelligence. Yearb. Med. Inform.
uation of a system for automatically assessing bias in clinical trials. J. Am. 28, 014–015.
Med. Inf. Assoc. 23, 193–201.
53. Kocaballi, A.B., Coiera, E., Tong, H.L., White, S.J., Quiroz, J.C., Rezaza-
35. Torres Torres, M., and Adams, C.E. (2017). RevManHAL: towards auto- degan, F., Willcock, S., and Laranjo, L. (2019). A network model of activ-
matic text generation in systematic reviews. Syst. Rev. 6, 27. ities in primary care consultations. J. Am. Med. Inform. Assoc. 26,
36. Tsafnat, G., Glasziou, P., Choong, M.K., Dunn, A., Galgani, F., and 1074–1082.
Coiera, E. (2014). Systematic review automation technologies. Syst. 54. Finley, G., Edwards, E., Robinson, A., Brenndoerfer, M., Sadoughi, N.,
Rev. 3, 74. Fone, J., et al. (2018). An Automated Medical Scribe for Documenting
37. O’Connor, A.M., Tsafnat, G., Gilbert, S.B., Thayer, K.A., Shemilt, I., Clinical Encounters. Proceedings of the 2018 Conference of the North
Thomas, J., Glasziou, P., and Wolfe, M.S. (2019). Still moving toward American Chapter of the Association for Computational Linguistics:
automation of the systematic review process: a summary of discussions Demonstrations, pp. 11–15.
at the third meeting of the International Collaboration for Automation of 55. Hodgson, T., and Coiera, E. (2016). Risks and benefits of speech recog-
Systematic Reviews (ICASR). Syst. Rev. 8, 57. https://doi.org/10.1186/ nition for clinical documentation: a systematic review. J. Am. Med.
s13643-019-0975-y. Inform. Assoc. 23, e169–e179. https://doi.org/10.1093/jamia/ocv152.
38. White, H., Tendal, B., Elliott, J., Turner, T., Andrikopoulos, S., and Zoun- 56. Quiroz, J.C., Laranjo, L., Kocaballi, A.B., Briatore, A., Berkovsky, S., Re-
gas, S. (2020). Breathing life into Australian diabetes clinical guidelines. zazadegan, D., and Coiera, E. (2020). Identifying relevant information in
Med. J. Aust. 212, 250–251.e1. medical conversations to summarize a clinician-patient encounter.
39. Tsafnat, G., Glasziou, P., Karystianis, G., and Coiera, E. (2018). Auto- Health Informatics J. 26, 2906–2914.
mated screening of research studies for systematic reviews using study 57. Hodgson, T., Magrabi, F., and Coiera, E. (2017). Efficiency and safety of
characteristics. Syst. Rev. 7, 64. https://doi.org/10.1186/s13643-018- speech recognition for documentation in the electronic health record.
0724-7. J. Am. Med. Inform. Assoc. 24, 1127–1133. https://doi.org/10.1093/ja-
40. Sim, I., Olasov, B., and Carini, S. (2003). The Trial Bank system: capturing mia/ocx073.
randomized trials for evidence-based medicine. AMIA Annu. Symp. 58. Wang, J., Lavender, M., Hoque, E., Brophy, P., and Kautz, H. (2021). A
Proc., 1076. patient-centered digital scribe for automatic medical documentation.
41. Alper, B.S., Richardson, J.E., Lehmann, H.P., and Subbian, V. (2020). It is JAMIA open 4, ooab003.
time for computable evidence synthesis: the COVID-19 Knowledge 59. Park, J., Kotzias, D., Kuo, P., Logan Iv, R.L., Merced, K., Singh, S., Ta-
Accelerator initiative. J. Am. Med. Inform. Assoc. 27, 1338–1339. nana, M., Karra Taniskidou, E., Lafata, J.E., Atkins, D.C., et al. (2019). De-
https://doi.org/10.1093/jamia/ocaa114. tecting conversation topics in primary care office visits from transcripts of
42. Dunn, A.G., and Bourgeois, F.T. (2020). Is it time for computable evi- patient-provider interactions. J. Am. Med. Inform. Assoc. 26, 1493–1504.
dence synthesis? J. Am. Med. Inform. Assoc. 27, 972–975. https://doi. 60. Lacson, R.C., Barzilay, R., and Long, W.J. (2006). Automatic analysis of
org/10.1093/jamia/ocaa035. medical dialogue in the home hemodialysis domain: structure induction
43. Gallego, B., Dunn, A.G., and Coiera, E. (2013). Role of electronic health and summarization. J. Biomed. Inform. 39, 541–555.
records in comparative effectiveness research. J. Comp. Eff. Res. 2, 61. Osborne, J.D., Lin, S., Zhu, L.J., and Kibbe, W.A. (2007). Mining biomed-
529–532. https://doi.org/10.2217/cer.13.65. ical data using MetaMap transfer (MMtx) and the unified medical
44. Gallego, B., Walter, S.R., Day, R.O., Dunn, A.G., Sivaraman, V., Shah, N., language system (UMLS). In Gene Function Analysis (Springer),
Longhurst, C.A., and Coiera, E. (2015). Bringing cohort studies to the pp. 153–169.
bedside: framework for a "green button’ to support clinical decision- 62. van Buchem, M.M., Boosman, H., Bauer, M.P., Kant, I.M.J., Cammel,
making. J. Comp. Eff. Res. 4, 191–197. https://doi.org/10.2217/cer. S.A., and Steyerberg, E.W. (2021). The digital scribe in clinical practice:
15.12. a scoping review and research agenda. NPJ Digit. Med. 4, 57–58.
45. Klann, J.G., and Szolovits, P. (2009). An Intelligent Listening Framework 63. Navarro, D.F., Dras, M., and Berkovsky, S. (2022). Few-shot Fine-Tuning
for Capturing Encounter Notes from a Doctor-Patient Dialog. BMC Med- SOTA Summarization Models for Medical Dialogues. Proceedings of the
ical Informatics and Decision Making9, p. S3. 2022 Conference of the North American Chapter of the Association for
46. Lin, S.Y., Shanafelt, T.D., and Asch, S.M. (2018). Reimagining Clinical Computational Linguistics: Human Language Technologies: Student
Documentation with Artificial Intelligence, 5 (Elsevier), pp. 563–565. Research Workshop, pp. 254–266.
47. Coiera, E., Kocaballi, B., Halamka, J., and Laranjo, L. (2018). The digital 64. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-
scribe. NPJ Digit. Med. 1, 58. https://doi.org/10.1038/s41746-018- training of Deep Bidirectional Transformers for Language Understanding.
0066-9. Preprint at arXiv. https://doi.org/10.48550/ARXIV.1810.04805.
48. Shanafelt, T.D., West, C.P., Sinsky, C., Trockel, M., Tutty, M., Satele, 65. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P.,
D.V., Carlasare, L.E., and Dyrbye, L.N. (2019). Changes in Burnout Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). language
and Satisfaction with Work-Life Integration in Physicians and the models are few-shot learners. In Advances in Neural Information Pro-
General US Working Population between 2011 and 2017, 9 (Elsevier), cessing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
pp. 1681–1694. and H. Lin, eds. (Curran Associates, Inc).
105. Han, Y.Y., Carcillo, J.A., Venkataraman, S.T., Clark, R.S.B., Watson, for the safe use of artificial intelligence in patient care. BMJ Health
R.S., Nguyen, T.C., Bayir, H., and Orr, R.A. (2005). Unexpected increased Care Inform. 26, e100081.
mortality after implementation of a commercially sold computerized 119. Scott, I.A., Cook, D., Coiera, E.W., and Richards, B. (2019). Machine
physician order entry system. Pediatrics 116, 1506–1512. learning in clinical practice: prospects and pitfalls. Med. J. Aust. 211,
106. Haibe-Kains, B., Adam, G.A., Hosny, A., Khodakarami, F., Massive Anal- 203–205.e1.
ysis Quality Control MAQC Society Board of Directors; Waldron, L.,
120. Tangari, G., Ikram, M., Ijaz, K., Kaafar, M.A., and Berkovsky, S. (2021).
Wang, B., McIntosh, C., Goldenberg, A., Kundaje, A., et al. (2020). Trans-
Mobile health and privacy: cross sectional study. BMJ 373, n1248.
parency and reproducibility in artificial intelligence. Nature 586, E14–E16.
https://doi.org/10.1136/bmj.n1248.
https://doi.org/10.1038/s41586-020-2766-y.
121. Magrabi, F., Habli, I., Sujan, M., Wong, D., Thimbleby, H., Baker, M., and
107. Rivera, S.C., Liu, X., Chan, A.-W., Denniston, A.K., and Calvert, M.J.
Coiera, E. (2019). Why is it so difficult to govern mobile apps in health-
(2020). Guidelines for Clinical Trial Protocols for Interventions Involving
care? BMJ Health Care Inform. 26, e100006. https://doi.org/10.1136/
Artificial Intelligence: The SPIRIT-AI Extension. BMJ 370, m3210.
bmjhci-2019-100006.
108. Liu, X., Cruz Rivera, S., Moher, D., Calvert, M.J., and Denniston, A.K.;
122. Kocaballi, A.B., Quiroz, J.C., Rezazadegan, D., Berkovsky, S., Magrabi,
SPIRIT-AI and CONSORT-AI Working Group (2020). Reporting guide-
F., Coiera, E., and Laranjo, L. (2020). Responses of conversational agents
lines for clinical trial reports for interventions involving artificial intelli-
to health and lifestyle prompts: investigation of appropriateness and pre-
gence: the CONSORT-AI extension. Lancet. Digit. Health 2, e537–
sentation structures. J. Med. Internet Res. 22, e15823. https://doi.org/10.
e548. https://doi.org/10.1016/s2589-7500(20)30218-1.
2196/15823.
109. Bengtsson, E., and Malm, P. (2014). Screening for cervical cancer using
automated analysis of PAP-smears. Comput. Math. Methods Med. 2014, 123. Fraser, H., Coiera, E., and Wong, D. (2018). Safety of patient-facing dig-
842037. https://doi.org/10.1155/2014/842037. ital symptom checkers. Lancet 392, 2263–2264. https://doi.org/10.1016/
S0140-6736(18)32819-8.
110. Magrabi, F., Ong, M.-S., Runciman, W., and Coiera, E. (2010). An analysis
of computer-related patient safety incidents to inform the development of 124. Panch, T., Mattie, H., and Celi, L.A. (2019). The ‘‘inconvenient truth’’
a classification. J. Am. Med. Inform. Assoc. 17, 663–670. https://doi.org/ about AI in healthcare. NPJ Digit. Med. 2, 77–83.
10.1136/jamia.2009.002444. 125. Chen, J.H., and Asch, S.M. (2017). Machine learning and prediction in
111. Magrabi, F., Ong, M.-S., Runciman, W., and Coiera, E. (2011). Patient medicine - beyond the peak of inflated expectations. N. Engl. J. Med.
safety problems associated with heathcare information technology: an 376, 2507–2509. https://doi.org/10.1056/NEJMp1702071.
analysis of adverse events reported to the US Food and Drug Administra- 126. Finlayson, S.G., Subbaswamy, A., Singh, K., Bowers, J., Kupke, A., Zit-
tion. In AMIA Annual Symposium Proceedings (American Medical Infor- train, J., Kohane, I.S., and Saria, S. (2021). The clinician and dataset shift
matics Association). in artificial intelligence. N. Engl. J. Med. 385, 283–286. https://doi.org/10.
112. Magrabi, F., Ong, M.S., Runciman, W., and Coiera, E. (2012). Using FDA 1056/NEJMc2104626.
reports to inform a classification for health information technology safety
127. Coiera, E. (2016). Chapter 12: implementation. In Guide to Health Infor-
problems. J. Am. Med. Inform. Assoc. 19, 45–53.
matics (CRC Press), pp. 173–194.
113. Chinese State Council (2017). NewgenerationofArtificialIntelligence
128. Hasson, H. (2010). Systematic evaluation of implementation fidelity of
Development Plan. https://flia.org/noticestate-council-issuing-new-
complex interventions in health and social care. Implement. Sci. 5, 67.
generation-artificial-intelligence-developmentplan/.
129. Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Gha-
114. European Commission (2021). Proposal for a Regulation of the European
vamzadeh, M., Fieguth, P., Cao, X., Khosravi, A., Acharya, U.R., et al.
Parliament and of the Council Laying Down Harmonised Rules on
(2021). A review of uncertainty quantification in deep learning: tech-
artificial Intelligence (Artificial Intelligence Act) and Amending Certain
niques, applications and challenges. Inf. Fusion 76, 243–297. https://
Union Legislative Acts {SEC(2021) 167 final} - {SWD(2021) 84 final} -
doi.org/10.1016/j.inffus.2021.05.008.
{SWD(2021) 85 final}. https://eur-lex.europa.eu/legal-content/EN/TXT/?
uri=CELEX%3A52021PC0206. 130. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K.Q. (2017). On Calibration
115. The U.S. Food and Drug Administration (FDA). Artificial Intelligence of Modern Neural Networks (PMLR), pp. 1321–1330.
and Machine Learning (AI/ML)-Enabled Medical Devices. https:// 131. Wang, L., Ju, L., Zhang, D., Wang, X., He, W., Huang, Y., Yang, Z., Yao,
www.fda.gov/medical-devices/software-medical-device-samd/artificial- X., Zhao, X., Ye, X., and Ge, Z. (2021). Medical Matting: A New Perspec-
intelligence-and-machine-learning-aiml-enabled-medical-devices. tive on Medical Segmentation with Uncertainty (Springer International
116. Challen, R., Denny, J., Pitt, M., Gompels, L., Edwards, T., and Tsaneva- Publishing), pp. 573–583. held in Cham.
Atanasova, K. (2019). Artificial intelligence, bias and clinical safety. BMJ 132. Raghuram, J., Chandrasekaran, V., Jha, S., and Banerjee, S. (2021). A
Qual. Saf. 28, 231–237. General Framework for Detecting Anomalous Inputs to DNN Classifiers
117. Lyell, D., and Coiera, E. (2017). Automation bias and verification (PMLR), pp. 8764–8775.
complexity: a systematic review. J. Am. Med. Inform. Assoc. 24, 133. Liu, S., Shah, Z., Sav, A., Russo, C., Berkovsky, S., Qian, Y., Coiera, E.,
423–431. https://doi.org/10.1093/jamia/ocw105. and Di Ieva, A. (2020). Isocitrate dehydrogenase (IDH) status prediction
118. Sujan, M., Furniss, D., Grundy, K., Grundy, H., Nelson, D., Elliott, M., in histopathology images of gliomas using deep learning. Sci. Rep. 10,
White, S., Habli, I., and Reynolds, N. (2019). Human factors challenges 7733. https://doi.org/10.1038/s41598-020-64588-y.