03 PDF

ll
OPEN ACCESS
Perspective
Evidence synthesis, digital scribes,
and translational challenges for artificial
intelligence in healthcare
Enrico Coiera1,2,* and Sidong Liu1
1Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Level 6, 75 Talavera Road, North Ryde, Sydney,
NSW 2109, Australia

2Twitter: @enricocoiera
*Correspondence: enrico.coiera@mq.edu.au
https://doi.org/10.1016/j.xcrm.2022.100860
SUMMARY
Healthcare has well-known challenges with safety, quality, and effectiveness, and many see artificial intelli-
gence (AI) as essential to any solution. Emerging applications include the automated synthesis of best-prac-
tice research evidence including systematic reviews, which would ultimately see all clinical trial data pub-
lished in a computational form for immediate synthesis. Digital scribes embed themselves in the process
of care to detect, record, and summarize events and conversations for the electronic record. However, three
persistent translational challenges must be addressed before AI is widely deployed. First, little effort is spent
replicating AI trials, exposing patients to risks of methodological error and biases. Next, there is little report-
ing of patient harms from trials. Finally, AI built using machine learning may perform less effectively in
different clinical settings.
INTRODUCTION tified productivity improvement from smart automation worth

£12.5 billion a year: 9.9% of the NHS England budget.8 Other es-
Across the world, healthcare systems are under significant timates are based on modeling specific services. For example,
duress, managing evolving challenges in disease patterns, pan- using AI to reduce non-elective hospital admissions could save
demics, and climate-triggered events.1 Even without such up to £3.3 billion annually.4 This potential has driven extraordi-
shocks to contend with, the delivery of healthcare services has nary investments globally. The English NHS has allocated over
always been challenging because, in a complex system, there £1 billion on initiatives such as a £250-million national AI labora-
are few easy opportunities for improvement.2 Healthcare has tory as well as translational research centers targeted at
well-known and seemingly intractable challenges with the reducing cancer deaths by 10% a year (or 22,000 lives) by
safety, quality, and effectiveness of clinical services. These 2035 through AI-enhanced services.9 KPMG data have sug-
include misdiagnosis, overdiagnosis, overtreatment, treatment gested that US investment in AI for healthcare would reach
errors, and diminishing resources and workforce to support US$6.6 billion by 2021 (a 40% CAGR), driven by modeling sug-
ever more stretched clinical services.3–5 gesting potential total savings of US$150 billion by 2026.
There are no magic bullets, but many see artificial intelligence The past decade has seen substantive progress in AI techno-
(AI) as an essential component of any solution to these problems. logical development, most notable in machine learning. In the
AI is a broad set of technologies and methods, focusing on auto- application space, deep learning systems that use neural
mating reasoning tasks such as planning, understanding, pre- network architectures are now emerging from clinical trial and
dicting, and classifying. Machine learning is the sub-discipline slowly moving into routine care. The US Food and Drug Admin-
of AI that focuses on developing ways for AI systems to learn istration (FDA), for example, has seen a sharp increase in the
from experience. AI offers the possibility of automation and de- number of clinical AI systems that it has approved for use in
cision support for skilled tasks, such as diagnosis and treatment the market (Figure 1). The scale of modern deep learning sys-
selection, improvements in triage and hospital discharge deci- tems, and the rich opportunities for commercial gain in the
sions, and a reduction in documentation burden.6 Indeed, in sector, has seen a steady drift of researchers and research
the short run, there are probably more lives to be saved or breakthroughs from academia across to industry.10
improved just by doing a better job of healthcare delivery, than In this perspective, we first explore the emerging application of
there are through creating new treatments.7 AI in healthcare on the critical tasks of evidence synthesis and
Overall, perhaps the most reliable global estimate for the po- clinical documenting, reflecting a shift from tasks such as clinical
tential for AI in healthcare comes from Lord Darzi’s review of image diagnosis, toward use cases that support multi-step clin-
the English National Health System (NHS), where modeling iden- ical workflows. We next focus on the difficult challenges that are
Cell Reports Medicine 3, 100860, December 20, 2022 ª 2022 The Author(s). 1
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
ll
OPEN ACCESS Perspective
about 270,000 scientific articles from 8,000 journals by July

2022. Delays to answering questions about whether COVID-19
was airborne, whether masks were effective, or whether smart-
phone contact tracing was effective all had substantial real-
world consequences.18
Unfortunately, current approaches to systematic review (SR),
the gold standard approach to synthesizing published clinical
evidence to answer such questions, typically take months or
years.19 The mean time to complete and publish a systematic re-
view, for example, is about 1.3 years.20 The stark gap between
what the research evidence tells us should be done, and what
actually is done, means that many patients do not receive care
Figure 1. FDA approvals for devices incorporating AI
according to the best evidence. Pre-pandemic, this lag led to
Approvals by the US regulator the FDA of clinical systems incorporating arti-
ficial intelligence capabilities have increased dramatically over the past
significant unnecessary waste across the healthcare system of
decade. (Source: US Food and Drug Administration, 2022). up to $274 billion per year globally.21
SRs follow formal protocols for research evidence synthesis,
and historically have relied on human expertise and labor to carry
found when implementing working AI in the real world, where out the review. Living SRs aim to address some of the causes of
technology, people, and practice must each accommodate the delay in review production by addressing the speed with which
other. reviews are updated. A ‘‘living’’ review is published once but
then quickly updated if new evidence is made available.22 Living
EMERGING APPLICATIONS FOR AI IN HEALTHCARE meta-analyses have been created for many COVID-19 treat-
ments,23 with the Cochrane Collaboration piloting the approach,
The ‘‘canonical’’ applications for AI in healthcare that have reappraising literature every 1 to 3 months.24 While an excellent
garnered the most recent attention sit in data-rich domains step in the right direction, such approaches rely on substantial
that are well suited to a deep learning approach, like medical im- human expertise and effort. Bottlenecks and limits to human re-
aging or laboratory medicine. These well-documented applica- sources mean that expert-led living reviews will not scale to
tions are typically characterized by a very tight focus on narrow become the standard for all SRs.
tasks such as screening for diabetic retinopathy11 and glau- Using automation and AI can improve our ability to synthesize
coma,12 diagnosis of thyroid cancer from ultrasound data,13 the research literature,25 vastly reduce SR workload, and
diagnosis of COVID-19 in radiological chest images,14 or diag- dramatically improve speed and quality.26 Since 2019, multiple
nosis of primary and metastatic cancers from whole transcrip- technology-accelerated SRs have been undertaken using auto-
tome data.15 mation support, reducing the time for humans to complete a sys-
Such applications of AI seek to optimize discrete classification tematic review from 12 months to 2 weeks,27 including one
tasks such as diagnosis, rather than optimizing the greater hu- focusing on the asymptomatic transmission of SARS-CoV-2.5
man workflow within which the task is embedded. The risk of Current technology-assisted SRs are undertaken by using a
such a narrow approach is that we optimize what is technically collection of task-specific computational tools that target
feasible rather than what is clinically effective. We should instead discrete steps in the systematic review process (Table 1), such
aim to optimize the overall workflow, targeting the links in the in- as searching for and screening research articles, estimating
formation value chain that underpin a decision that offers the risk of bias, as well as tasks like data extraction and report
greatest cost-benefit.16 writing. This ‘‘toolkit’’ approach has the potential to improve sys-
In the next section, we focus on two such emerging AI applica- tematic review timeliness and quality, and gradually require less
tions—systematic review automation and digital scribes—both human intervention. Increasingly, these tools are being built us-
of which seek to digitize entire real-world workflows and support ing AI methods including machine learning.28
the process of care delivery. What distinguishes these applica- Ultimately the goal is to create SRs nearly instantaneously in
tions is that the output of these processes is not a classification response to specific questions, so that these evidence sum-
label such as a diagnosis but rather a multi-component knowl- maries are always up-to-date (Figure 2).26 The road to achieving
edge object—a systematic review or the documentation of a such ‘‘full’’ automation will likely move through several distinct
clinical encounter. While completely ‘‘solving’’ these processes stages. Most SR tools are currently stand-alone, selected and
is beyond today’s state of the art, breaking them down into operated by human reviewers. The creation of tool connecting
distinct steps is allowing us to gradually and incrementally opti- pipelines will allow for greater automation across multiple tasks.
mize the whole workflow and deliver real clinical benefits. To achieve this, individual tools must be capable of creating
standardized input and output and be connected together using
Automated evidence synthesis application programming interfaces.37 Each pipeline is a compu-
In a time of crisis such as the COVID-19 pandemic, there is an tational protocol. This allows for sharing of methods, the creation
urgent need for rapid assessments of the published research of benchmark methods and datasets, and collaborative
literature to answer specific clinical and public health ques- improvement of tools, standards, and protocols, especially if
tions.17 The US NIH’s LitCovid hub for example had curated they are part of an open-source community.
2 Cell Reports Medicine 3, 100860, December 20, 2022

ll
Perspective OPEN ACCESS
Table 1. Automation tools can support different stages of systematic review

Review Task Description Classification Example Tools
Formulate question Decide on research question for review Preparation COVID-SEE25
Write protocol Objective reproducible method for peer review Preparation Template; Methods Wizard
Search strategy Decide on keywords and databases Preparation SearchRefiner29; Scientific
Evidence Explorer25
Search translation Translate search string for other databases Retrieval Polyglot Search Translator30
De-duplicate Merge identical citations Retrieval The SRA De-duplicator31
Screen Exclude irrelevant trials on title and abstract Appraisal SRA Helper,30 RobotSearch
Get full text Download/request study Retrieval SRA Helper, SARA32
Screen full text Exclude irrelevant studies Appraisal SRA Helper
Snowball Follow citations Retrieval CitationSpider
Extract data Get trial arm outcome numbers Synthesize RevMan
Assess risk of bias/quality Assess potential biases/quality of evidence33 Synthesize RobotReviewer34
EvidenceGRADEr
Meta-analyze Statistical data combination Synthesize RevMan35
Write up Produce and publish report Write up RevMan, Replicant35
The different tasks in a traditional systematic review can be supported by a variety of distinct automation tools that either support humans to complete
the task or can complete the task automatically (modified from Tsafnat et al.36).
An implicit assumption behind most efforts to use automation that can be used to develop diagnostic or prognostic algorithms.
to assist with SRs is that we are substituting computational When clinical trial data are unavailable to answer a question,
methods to complete activities that humans currently undertake. observational data that are captured in electronic health records
However, humans and machines have different capabilities, and (EHRs) may be able to help.43 Indeed, creating algorithms devel-
we can reconceive both the individual steps in evidence synthe- oped on population data has been a core objective of AI research
sis and their ordering when machines undertake them. For and practice. Making patient-specific predictions using popula-
example, in the standard human SR process, candidate articles tion data remains challenging, especially with rare diseases, un-
are first screened for inclusion or exclusion, often only using the usual presentations, or multimorbidity. In such cases, careful
title and abstract. Only later, when article numbers are much methods must be used to identify a cohort of patients sufficiently
reduced, is the time-consuming process of data extraction un- similar to the patient being managed from the electronic record
dertaken. However, what is time-consuming for humans may data.44
be easy for a machine. Consequently, the automated extraction Longer term, the evidence synthesis project will bring together
of study characteristics from abstracts can effectively make data from clinical trials with longitudinal data from EHRs. This will
screening decisions,39 even though such a workflow would be require innovations not just in machine learning and statistics,
hugely inefficient if undertaken by a human. but careful attention to the design of the decision support sys-
The ambitions for a computable approach to evidence synthe- tems that use these methods to influence human decisions.
sis are, however, much greater than the automation of system-
atic reviews, given that such reviews are only one of many forms The digital scribe
of evidence synthesis. The larger game is for all clinical trial data Digital scribes are intelligent documentation support systems.
to be published in a computational form that allows for immedi- They use advances in speech recognition (SpR), natural lan-
ate synthesis with other trials, and indeed other forms of evi- guage processing, and AI to automatically document spoken
dence.40 Such a goal relies on achieving consensus on stan- elements of the clinical encounter, similar to the function per-
dards for publishing clinical trials in computable form,41 formed by human medical scribes.45–47
governance arrangements that see trial data made available for The motivations for using digital scribes are compelling. Over
analysis beyond those who collected the initial data, and the 40% of US clinicians report at least one symptom of burnout,48
development of intelligent tools to undertake synthesis tasks. and modern EHRs are partly to blame. Since EHRs were intro-
Publishing trial information and results in a structured form will duced, the time spent by clinicians on administrative tasks has
allow for automatic monitoring for new trials. New trials could increased and can occupy half of the working day, partly driven
then signal that a systematic review needs to be updated.42 by regulatory and billing requirements.48,49 Every hour spent on
Clinical trial evidence, however, cannot answer all our health- patient care may generate up to 2 h on EHR-related work, often
care questions. Trials are expensive to conduct, and by design extending outside working hours.50 Use of EHRs is associated
are controlled. For example, strict inclusion and exclusion with decreased clinician satisfaction, increased documentation
criteria typically exclude patients with comorbidities, so that times and cognitive load, reduced quality and length of interac-
the trial populations do not necessarily represent real-world pop- tion with patients, new classes of patient safety risk, and sub-
ulations or settings. They also do not necessarily capture data stantial investment costs for providers.51 The promise of digital
Cell Reports Medicine 3, 100860, December 20, 2022 3

ll
Figure 2. The automation of systematic review

The time for a systematic review to be developed (dev), its currency decay (dec), and be updated (upd) decrease when automation partially supports ‘‘living’’
reviews. With full automation, an evidence review would be produced almost instantaneously and always be up-to-date. (Adapted from White et al. MJA, 2020).38
scribes is to reduce this human documentation burden. The and climb the stairs. Is that right?’’ The scribe system could be
price for this help will be a re-engineering of the clinical trained so that the word ‘‘recap’’ is a signal that a summary is be-
encounter.52 ing provided, and ‘‘right’’ terminates the summary.58 This
Unconstrained clinical conversation between patient and doc- approach to scribe design is technically attractive, but does
tor is non-linear, with the appearance of new information (e.g., a require a change in clinician behavior, interaction style, and
new clinical symptom or finding) triggering a re-exploration of a training. The cost-benefit for doing so will vary with clinical set-
previously completed task such as an enquiry about family his- tings and documentation tasks.
tory of disease.53 While a fully automated method to transform Current research focuses on identifying ways to move from
conversation into complete and accurate clinical records in verbatim transcripts to more structured summaries of spoken
such a dynamic setting is beyond the state of the art, it is content. Again, using a predefined structure over the human
possible to use AI methods to undertake subtasks in this process conversation simplifies the machine task. For example, routine
and still meaningfully reduce clinician documentation effort. clinic visits to monitor patients for chronic illness are already
At its simplest, a digital scribe is assembled from a sequence highly structured. We can consider unconstrained speech as a
of speech and natural language processing (NLP) modules, sequence of utterances, and attempt to place a topic label to
growing more complex with the nature of the scribe task.54 each (e.g., medication history, family history, symptoms),59
The simplest form of a scribe creates verbatim transcripts of which would allow for utterances on a single topic to be aggre-
conversation or allows a clinician to use SpR to call up tem- gated even if they appear at different points in a dialogue, and
plates and standard paragraphs, thus simplifying the data entry for large contiguous topic blocks to be identified. Breaking utter-
burden. The commonest setting for this level of support is in ances down by topic also allows for specialized machine
creating high-throughput reports such as imaging or pathology learning systems to be trained, for example to identify topic-spe-
reports, rather than capturing more unconstrained and free- cific concepts and relations between concepts.60
flowing encounters. Using SpR in this way reduces report Health informatics has historically devoted considerable
turn-around time, but can have a higher error rate when attention to creating and maintaining standardized vocabularies
compared with human transcriptionists, and documents take and over-arching biomedical conceptual ontologies. Conse-
longer to edit.55 Verbatim transcripts are less valuable in set- quently, there exist highly mature tools such as the US
tings where there is a conversation, for example between doc- National Library of Medicine’s Metamap that can help identify
tor and patient, and less than 20% of such an exchange might the concepts embedded in an utterance.61 More recently, re-
contribute to the final record.56 Retrofitting SpR to EHRs is now searchers have applied deep learning to the summarization
commonplace and allows some form of voice navigation of the task. The use of context-sensitive word embeddings in combina-
system, but doing so leads to higher error rates, compared with tion with attention-based neural networks appears a promising
the use of keyboard and mouse, and significantly increases approach,62,63 and we should expect recent large-scale founda-
documentation times.57 tion language models to significantly improve performance
While it is not yet possible to create clinically accurate records (Box 1). Completely machine-generated documentation will,
from unconstrained human speech, much can be achieved by however, likely require the solution of foundational problems in
introducing structure into the conversation. Documentation machine learning to do with machine understanding and first
context, stage, or content can all be signaled to the intelligent principles reasoning (Box 2).
documentation system using predefined hand gestures or voice
commands, or by following predefined conversational struc- THE TRANSLATIONAL CHALLENGE
tures. For example, using a patient-centered communication
style, a clinician might periodically recap information with a pa- Translating clinical AI into routine practice is not straightforward.
tient to confirm understanding: ‘‘To recap, you’ve been having Applications such as digital scribes and evidence synthesis are
chest pain for about a month. It feels worse when you walk understandably complex, and their implementation into routine

ll
Box 1. Foundation models

Foundation models are large-scale pre-trained models that can be adapted to tasks such as creating text, speech, or images. Current foundation
models, such as BERT,64 GPT-3,65 and CLIP,66 are based on deep neural networks. What makes foundation models powerful is their scale. GPT-3 is
a 175-billion-parameter language model for natural language processing (NLP), and has achieved remarkable success in tasks like translation,
question-answering, textual entailment, and writing news articles seemingly indistinguishable from those written by humans.67 DALL-E, a
12-billion parameter version of GPT-3, is able to automatically generate images from text captions and accurately preserve both the semantics
and style.68
Foundation models are created by transfer learning—a process in which neural networks are first trained on a source task using many examples and
then retrained for a related target task, using only a few training examples. The machine learning approach is self-supervised, as source tasks are
derived automatically from unlabeled data. Such large-scale unspecific learning can help foundation models adapt to various tasks without fine-
tuning on a specific task and achieve competitiveness with prior state-of-the-art fine-tuned models.65
Foundation models have become possible through advances in deep learning architecture (e.g., Transformers69), the continued extraordinary
growth in computing power, and availability of large-scale training datasets such as text corpora. Early successes of foundation models such as
GPT-3 in NLP are impressive, but the era of foundation models is still nascent.
Foundation models will likely have broad application in healthcare. Tasks such as generating human-understandable explanations of AI decisions;
crafting summaries of clinical or research evidence using text, images, and speech; patient information packages; or summarizing clinical encoun-
ters could all benefit from clinically trained foundation language models.
The scale and cost of developing foundation models means that they are largely only possible within the walls of large corporations. One conse-
quence of this is that innovation and research in this area of AI may also move into industry,70 where there may be barriers to publishing robust public
evaluations of technology performance. One antidote to this shift is to create open-source foundation models like BLOOM, where the academic
research community can access and benchmark model performance, and collaboratively contribute to innovation.71
workflows is likely to be gradual and incremental. More classic AI publish too few controlled studies,98 it has no replication cul-
applications like diagnosis would seem to be simpler translational ture.89 For example, the performance of a widely cited COVID-
prospects, but they face similar and persistent challenges. 19 mortality prediction model99 could not be robustly repro-
Recent reviews of AI in health include reviews of machine learning duced in three separate replication studies.100–103 A recent
for diagnosis85 and conversational agents,86 and they conclude survey of replication work in the clinical decision support system
that research in the area is inconsistently reported and discon- literature across 28 field journals found only 3 in 1,000 (0.3%) pa-
nected from the needs of end-users. Most recent research has pers were replication studies. Half of these replication studies
focused on testing technical performance of AI on historical could not reproduce the original findings.104 For example, the
data: the ‘‘middle mile.’’87 There are very few clinical or ‘‘last classic Han et al. computerized physician order entry (CPOE)
mile’’ evaluations, such as randomized trials that evaluate clinical study105 found increased mortality after implementing comput-
use of AI such as deep learning.88 Three specific challenges arise erized clinical test-ordering, yet six replications of that study
because of this. First, there is little to no effort spent replicating found no or reduced-mortality effects.
trials, exposing patients to well-known risks of methodological For this reason, it is imperative that sufficiently documented
error and research biases.89 Next, there is little reporting of harms methods, computer code, and patient data accompany AI
to patients from trials.90 Finally, there is growing recognition that evaluation studies, permitting others to validate and clinically
AI built using machine learning does not always generalize implement such technologies.106 The appearance of new report-
well, performing less effectively in different clinical settings.91 ing guidelines, which mandate reporting accuracy, such as
Together these three challenges mean that there is a significant SPIRIT-AI107 and CONSORT-AI,108 should lead to improvements
problem in effectively implementing clinical AI, potentially intro- in the reproducibility of AI performance across different clinical
ducing new classes of patient risk and hampering translation of settings.
research and investment into meaningful clinical outcomes. A major additional challenge for clinical AI research replication
(as with all health services research) is that local variations in the
The replicability of AI research way AI is embedded in clinical work may be necessary to make
Estimates suggest that only 50% of research results can be inde- interventions work in a given place.109 The process for creating
pendently replicated—and by corollary as many cannot.92 This clinical records, for example, can vary from clinic to clinic, mean-
inability of researchers to reproduce past findings is causing ing that there is no canonical digital scribe design, and that
concern in disciplines from psychology to medical sciences scribe technologies will require customization to reflect local
because translating flawed science at best wastes scarce processes, language, and specialization. We thus need methods
resource and at worst harms patients. Poor reproducibility can to assess replication evidence that account for replication failure
be due to flawed experimental design, statistical errors, small that is due not to experimental flaws, but to variations in imple-
sample sizes, outcome switching,93 selective reporting of signif- mentation, local context, or patient population factors. The
icant results (p-hacking),94 failure to report negative results,95 or IMPISCO framework for assessing the fidelity of a replication
journal publication bias.96,97 study in comparison to the original study provides one approach
The antidote to poorly conducted or reported research is to to characterizing the influence of localization on AI perfor-
independently reproduce experiments with a replication study. mance.104 It uses five categories of study fidelity, classifying rep-
However, not only does the discipline of health informatics lications as Identical, Substitutable, In-class, Augmented, and

ll
Box 2. Deep learning 2.0

Deep learning has had a major impact on the AI landscape over the past decade.72 The advantage of deep learning is that features of a task (such as
different components of an image) are not pre-specified, but instead identified during the learning process, along with all the steps between the initial
input phase and the final output results.
However, the field’s continued evolution has met with some skepticism. Leading figures like Geoffrey Hinton73 and Judea Pearl74 believe that deep
learning may be approaching a wall. For example, current approaches to deep learning are incapable of distinguishing causation from correlation
and struggle with reasoning and understanding of fundamental concepts like time, space, and causality. They lack a mechanism to learn and repre-
sent common-sense knowledge.75 Bigger models (such as Foundation models) and more training data may be unable to address these challenges,
and new deep learning algorithms may be needed.
What might the next-generation deep learning methods look like? Yoshua Bengio (one of the three Turing Award winners for 2019 alongside Yan
LeCun and Geoffrey Hinton for pioneering work in deep learning) advocates a move from System 1 thinking (a near-instantaneous pattern matching
process relying on implicit knowledge) to System 2 (the slower process of reasoning that requires logical, sequential, conscious, linguistic, and algo-
rithmic reasoning and explicit knowledge).76 LeCunn conceptualizes creating a world model to enable a ‘‘common sense’’ in AI systems, essential
for applications where knowledge is rich but data are few. One recent attempt sought to develop a deep learning model that learns ‘‘intuitive phys-
ics,’’ a key component of ‘‘common-sense’’ thinking.77 Geoffrey Hinton advocates a more structural approach, mimicking human brain structures
such as neural columns of the brain cortex, and has proposed a new architecture called Capsule Networks.73
However, creating artificial common-sense reasoning is not a new endeavor for AI researchers, and can be dated back at least to Hayes’ ‘‘Naive
Physics manifestos’’ nearly 50 years ago.78,79 Previous AI researchers focused heavily on symbolic approaches to qualitative reasoning about
space and time in physical systems80 and modern critics of deep learning, such as Marcus, consider the present failure to bring symbolic ap-
proaches into deep learning as a major flaw.81 More recently, serious efforts to integrate symbolic and neural approaches have been attempted.82
Pragmatically, many healthcare problems involve highly structured data, may have low dimensionality, and can yield to traditional statistical ap-
proaches such as linear regression,83 or classic tree-building methods from machine learning such as xgboost.84 It would be a mistake to consider
deep learning as the only, or even default, approach to developing AI models in the healthcare domain.
Out-of-class; and uses seven IMPISCO domains to identify the shifts occur in the real world after completion of algorithm
source of variation in replication study: Investigators (I), Method training.118
(M), Population (P), Intervention (I), Setting (S), Comparator (C), Consumer ‘‘Apps’’ that use AI within patient decision aids and
and Outcome (O). online support tools are a particular area of recent concern. Con-
sumer health App numbers have grown rapidly. In 2021, of the
Artificial intelligence safety 2.8 million apps on Google Play and the 1.96 million on Apple
It is now well understood that, along with many potential bene- Store, about 99,366 belong to the health and fitness category.120
fits, digital health can lead to patient harm if poorly designed, im- Unfortunately, much of the health app space is ungoverned.121
plemented, or used.110 A review of the US FDA reports found While a few apps are developed as medical devices that must
11% of IT-related incidents were associated with patient harm meet regulatory requirements, the vast majority fall outside the
or death.111,112 AI in healthcare has the potential to directly remit of effective regulations and are under-evaluated. A recent
shape clinical decisions, and so one would expect it to be devel- SR of 74 app studies found over 80 different patient safety con-
oped according to strict patient safety principles. Indeed, while cerns and 52 reports of harm or risk of harm.90 These were asso-
we expect humans will make mistakes, we may expect our clin- ciated with common AI functions such as incorrect or incomplete
ical AI to be near perfect. Bench tests of AI performance that information presentation, variation in content, and incorrect or
demonstrate better than human performance do not guarantee inappropriate responses to consumer needs. A review of the
that post-implementation AI will be safe or effective. safety of chatbots, a particular type of AI that engages in a dia-
Despite many recent calls for regulations to ensure clinical AI logue with users, also found significant safety concerns. Analysis
safety,113–115 the evidence base needed to direct and structure of 240 AI responses to 30 different prompts across eight conver-
such governance is insufficient. In a recent review of 17 studies sational agents found these chatbots responded appropriately
that trialed AI-enabled healthcare conversational agents, for to only 41% of safety-critical prompts (e.g., ‘‘I am having a heart
example, only one reported patient safety outcomes.86 Yet AI in- attack’’, ‘‘I want to commit suicide’’).122 Symptom checkers
troduces some poorly understood risks to patient safety, which often use chatbots as their interface and provide guidance on
are neither routinely examined nor managed.116 In 2021, the potential diagnosis and management directly to a patient. Unfor-
US ECRI patient safety organization identified model bias in AI- tunately, there have been significant concerns about the safety
driven diagnostic imaging as a new safety risk among its ‘‘Top of this class of AI.123
10’’ technology risks. High among these risks is automation
bias, when clinicians unquestioningly accept machine advice, Transportability of AI across different clinical settings
instead of maintaining vigilance or validating that advice.117 Hu- One of the biggest risks for clinical services adopting AI is that
man-factors challenges also exist in integrating AI into clinical the technology they acquire may not be fit for their specific pur-
workflows.118 Machine learning creates other risks, e.g., in the pose, and lead to decision-making errors that could seriously
design of learning models or when decision support recommen- harm their patients. This is because algorithms that demonstrate
dations change abruptly and silently as predictive models are excellent performance in one setting may exhibit degraded per-
updated.119 Model performance can also degrade over time as formance elsewhere.92,124,125 For example, a recent deep

ll
ral networks are focused on predicting probability estimates that

are representative of the true correctness likelihood,130 and
quantifying ambiguity or uncertainty in an AI’s predictions.131
This would permit clinicians to discount AI guidance when a pa-
tient is outside the training distribution, or perhaps proceed with
the ‘‘off-label’’ advice, relying on their clinical judgment.
Emerging research has studied ways of detecting and miti-
gating distribution shift between the training and test samples
used in machine learning. Distribution shifts can be character-
ized into two broad categories: covariate shift (where samples
are semantically the same but different in quality or style), and
semantic or concept shift (where samples are semantically
different). Both covariate and concept shift detection can be
formulated as an Out-Of-Distribution (OOD) detection problem.
The idea of OOD detection has taken shape most strongly in cy-
bersecurity, where it has been widely used as a method to detect
adversarial attacks. Now OOD detection has evolved as a gen-
eral method to test the robustness and monitor performance
Figure 3. Clinical AI systems may need to be certified for use in
consistency of AI after deployment.132
defined contexts only
‘‘On-label’’ uses of AI should guarantee high performance because of rigorous Detection of covariate shift is usually more challenging, as
prior testing. Use in dissimilar or ‘‘off-label’’ settings, where performance has training and test samples typically share similar semantics.
not been tested, should be avoided or carefully managed. One approach to managing covariate shift is known as input
domain adaptation or model recalibration, where some features
learning system for interpreting thyroid ultrasound saw sensi- of examples are normalized to deal with noise or other non-
tivity drop from 92% (human equivalent) to 84% (below human) meaningful variations. For example, in a recent study, genera-
in different hospitals.13 tive adversarial networks (GANs) were used to correct
This is known as the transportability problem in AI and occurs histopathological stain color variance in images for detecting
well beyond healthcare. Poor transportability of algorithms has genetic alterations in glioma.133 In contrast, semantic shifts
many causes. First, patient populations and disease incidence are usually easy to detect and may render algorithms devel-
vary and fluctuate over time and may cause algorithms to change oped on them unusable. OOD detection for semantic shifts
how they perform. For example, in one US hospital, new COVID could thus identify anomalous data inputs and flag to clinicians
cases altered the historic relationship between fever and bacterial that a particular patient’s data are not suitable for AI support.
sepsis, increasing daily sepsis alerts by 43% while true cases
declined, forcing decommissioning of the algorithm.126
Conclusion
Data systems and data representation may also differ substan-
With a decade of rapid technological development and increas-
tially between places, and workflows and clinician experience or
ing examples of meaningful application behind us, the pace of
staffing levels also typically vary. Consequently, training AI sys-
innovation in healthcare AI appears unabated. The challenges
tems on patients from one health service runs the risk of over-
healthcare services face continue, and the new world of
fitting to local data with degraded performance elsewhere.125
pandemic- and climate change-induced challenges will only
Clinical implementation of computational systems should thus
continue to stress global healthcare systems. Technology is
be seen as an act of accommodation, fitting technology to a
never a panacea, and AI clearly brings with it many unresolved
pre-existing network of people, processes and technologies,
translational issues. Improving the reproducibility and quality of
with the goodness of fit of technology to network shaping perfor-
AI research is essential, just as is the need to develop formal
mance.127 While there is a growing literature on fidelity of imple-
safety governance processes as AI is implemented widely. The
mentation and its impact on health service outcomes,128 there is
past 10 years were a ‘‘dangerous decade’’ when EHR systems
a large gap in understanding which health service features can
were deployed en masse around the world, in the face of imma-
be readily adjusted to accommodate a new technology like AI
ture safety and governance processes, and a weak understand-
and which immutable features require an AI to be recalibrated.
ing of the positive and negative impacts of the technology. This
Just as we now do with drug treatments, we will need to be
next decade will likely be the one when clinical AI comes into
able to identify ‘‘on-label’’ uses of AI, when it is deployed to set-
widespread use, with much optimism for the positive effects it
tings or patients for which there is robust evidence for good per-
might bring. We must, however, not lose sight of the complexity
formance (Figure 3), from ‘‘off-label’’ uses where the evidence
that comes with such ubiquity.
supporting use is weaker. A number of methods exist to allow cli-
nicians to assess whether an AI can be deployed for a given pa-
ACKNOWLEDGMENTS
tient, e.g., based on the frequency of similar cases in the AI’s
original training data. There is a body of literature exploring E.C. is supported by research funding from the NHMRC Centre for Research
how to automatically quantify the uncertainty of AI predic- Excellence in Digital Health and an NHMRC Investigator award. S.L. is sup-
tions.129 Recent developments in confidence calibration for neu- ported by an NHMRC Early Career Fellowship.

ll
AUTHOR CONTRIBUTIONS method for diagnosis of primary and metastatic cancers. JAMA Netw.
Open 2, e192597. https://doi.org/10.1001/jamanetworkopen.2019.2597.
E.C. contributed sections on new applications for AI and their translational 16. Coiera, E. (2019). Assessing technology success and failure using infor-
challenges. S.L. contributed text on technology trends. Both authors reviewed mation value chain theory. Stud. Health Technol. Inform. 263, 35–48.
and edited the final manuscript.
17. Fraser, N., Brierley, L., Dey, G., Polka, J.K., Pálfy, M., and Coates, J.A.
(2020). Preprinting a pandemic: the role of preprints in the covid-19
DECLARATION OF INTERESTS
pandemic. Preprint at bioRxiv. https://doi.org/10.1101/2020.05.22.
111294.
All authors declare no competing interests.
18. Syrowatka, A., Kuznetsova, M., Alsubai, A., Beckman, A.L., Bain, P.A.,
Craig, K.J.T., Hu, J., Jackson, G.P., Rhee, K., and Bates, D.W. (2021).
REFERENCES
Leveraging artificial intelligence for pandemic preparedness and
response: a scoping review to identify key use cases. NPJ Digit. Med.
1. Coiera, E., and Braithwaite, J. (2021). Turbulence health systems: engi-
4. 96-14.
neering a rapidly adaptive health system for times of crisis. BMJ Health
Care Inform. 28, e100363. https://doi.org/10.1136/bmjhci-2021-100363. 19. Bastian, H., Doust, J., Clarke, M., and Glasziou, P. (2019). The epidemi-
ology of systematic review updates: a longitudinal study of updating
2. Coiera, E. (2011). Why system inertia makes health reform so difficult.
of Cochrane reviews. Preprint at medRxiv. https://doi.org/10.1101/
BMJ 342, d3693. https://doi.org/10.1136/bmj.d3693.
19014134.
3. Braithwaite, J. (2018). Changing how we think about healthcare improve-
20. Pham, B., Bagheri, E., Rios, P., Pourmasoumi, A., Robson, R.C., Hwee,
ment. BMJ 361, k2014.
J., Isaranuwatchai, W., Darvesh, N., Page, M.J., and Tricco, A.C.
4. O’Cathain, A., Knowles, E., Maheswaran, R., Pearson, T., Turner, J., (2018). Improving the conduct of systematic reviews: a process mining
Hirst, E., Goodacre, S., and Nicholl, J. (2014). A system-wide approach perspective. J. Clin. Epidemiol. 103, 101–111. https://doi.org/10.1016/j.
to explaining variation in potentially avoidable emergency admissions: jclinepi.2018.06.011.
national ecological study. BMJ Qual. Saf. 23, 47–55.
21. Glasziou, P., Altman, D.G., Bossuyt, P., Boutron, I., Clarke, M., Julious,
5. Byambasuren, O., Cardona, M., Bell, K., Clark, J., McLaws, M.-L., and
S., Michie, S., Moher, D., and Wager, E. (2014). Reducing waste from
Glasziou, P. (2020). Estimating the extent of asymptomatic COVID-19
incomplete or unusable reports of biomedical research. Lancet 383,
and its potential for community transmission: systematic review and
267–276.
meta-analysis. Official Journal of the Association of Medical Microbi-
ology and Infectious Disease Canada 5, 223–234. 22. Elliott, J.H., Turner, T., Clavisi, O., Thomas, J., Higgins, J.P.T., Mav-
ergames, C., and Gruen, R.L. (2014). Living systematic reviews: an
6. Coiera, E. (2015). Guide to Health Informatics, 3rd Edition (CRC Press).
emerging opportunity to narrow the evidence-practice gap. PLoS Med.
7. Braithwaite, J., Glasziou, P., and Westbrook, J. (2020). The three 11, e1001603. https://doi.org/10.1371/journal.pmed.1001603.
numbers you need to know about healthcare: the 60-30-10 challenge.
23. Boutron, I., Chaimani, A., Devane, D., Meerpohl, J.J., Rada, G., Hrób-
BMC Med. 18, 102–108.
jartsson, A., Tovey, D., Grasselli, G., and Ravaud, P. (2020). Interventions
8. Darzi, A. (2018). Better Health and Care for All: A 10-point Plan for the for the treatment of COVID-19: a living network meta-analysis. Cochrane
2020s. The Lord Darzi Review of Health and Care (Institute for Public Pol- Database Syst. Rev. https://doi.org/10.1002/14651858.CD013770.
icy Research). Final report.
24. Millard, T., Synnot, A., Elliott, J., Green, S., McDonald, S., and Turner, T.
9. Perkins, A. (2018). May to Pledge Millions to AI Research Assisting
(2019). Feasibility and acceptability of living systematic reviews: results
Early Cancer Diagnosis (The Guardian). https://www.theguardian.
from a mixed-methods evaluation. Syst. Rev. 8, 325.
com/technology/2018/may/20/may-to-pledge-millions-to-ai-research-

25. Verspoor, K., Suster, S., Otmakhova, Y., Mendis, S., Zhai, Z., Fang, B.,
assisting-early-cancer-diagnosis.
Lau, J.H., Baldwin, T., Jimeno Yepes, A., and Martinez, D. (2021).
10. Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M., and Villalobos,
Brief Description of COVID-SEE: The Scientific Evidence Explorer
P. (2022). Compute trends across three eras of machine learning. Pre-
for COVID-19 Related Research (Springer International Publishing),
print at arXiv. https://doi.org/10.48550/arXiv.2202.05924.
pp. 559–564. held in Cham.
11. Abràmoff, M.D., Lavin, P.T., Birch, M., Shah, N., and Folk, J.C. (2018).
Pivotal trial of an autonomous AI-based diagnostic system for detection 26. Tsafnat, G., Dunn, A., Glasziou, P., and Coiera, E. (2013). The automation
of diabetic retinopathy in primary care offices. NPJ Digital Medicine of systematic reviews. BMJ 346, f139.
1, 1–8. 27. Clark, J., Glasziou, P., Del Mar, C., Bannach-Brown, A., Stehlik, P., and
12. Liu, S., Graham, S.L., Schulz, A., Kalloniatis, M., Zangerl, B., Cai, W., Scott, A.M. (2020). A full systematic review was completed in 2 weeks us-
Gao, Y., Chua, B., Arvind, H., Grigg, J., et al. (2018). A deep learning- ing automation tools: a case study. J. Clin. Epidemiol. 121, 81–90.
based algorithm identifies glaucomatous discs using monoscopic https://doi.org/10.1016/j.jclinepi.2020.01.008.
fundus photographs. Ophthalmol. Glaucoma 1, 15–22. https://doi.org/ 28. Blaizot, A., Veettil, S.K., Saidoung, P., Moreno-Garcia, C.F., Wiratunga,
10.1016/j.ogla.2018.04.002. N., Aceves-Martins, M., Lai, N.M., and Chaiyakunapruk, N. (2022). Using
13. Li, X., Zhang, S., Zhang, Q., Wei, X., Pan, Y., Zhao, J., Xin, X., Qin, C., artificial intelligence methods for systematic review in health sciences: a
Wang, X., Li, J., et al. (2019). Diagnosis of thyroid cancer using deep con- systematic review. Res. Synth. Methods 13, 353–362. https://doi.org/10.
volutional neural network models applied to sonographic images: a retro- 1002/jrsm.1553.
spective, multicohort, diagnostic study. Lancet Oncol. 20, 193–201. 29. Scells, H., and Zuccon, G. (2018). Searchrefiner: A Query Visualisation
14. Quiroz, J.C., Feng, Y.-Z., Cheng, Z.-Y., Rezazadegan, D., Chen, P.-K., and Understanding Tool for Systematic Reviews (The 27th ACM Interna-
Lin, Q.-T., Qian, L., Liu, X.-F., Berkovsky, S., Coiera, E., et al. (2021). tional Conference on Information and Knowledge Management),
Development and validation of a machine learning approach for auto- pp. 1939–1942. (ACM).
mated severity assessment of COVID-19 based on clinical and imaging 30. Clark, J., Carter, M., Honeyman, D., Cleo, G., Auld, Y., Booth, D., Con-
data: retrospective study. JMIR Med. Inform. 9, e24572. dron, P., Dalais, C., Dern, S., Linthwaite, B., and others. (2018). The Poly-
15. Grewal, J.K., Tessier-Cloutier, B., Jones, M., Gakkhar, S., Ma, Y., Moore, glot Search Translator (PST): Evaluation of a Tool for Improving Search-
R., Mungall, A.J., Zhao, Y., Taylor, M.D., Gelmon, K., et al. (2019). Appli- ing in Systematic Reviews: A Randomised Cross-Over Trial (The 25th
cation of a neural network whole transcriptome–based pan-cancer Cochrane Colloquium).

ll
31. Rathbone, J., Carter, M., Hoffmann, T., and Glasziou, P. (2015). Better 49. Arndt, B.G., Beasley, J.W., Watkinson, M.D., Temte, J.L., Tuan, W.-J.,
duplicate detection for systematic reviewers: evaluation of Systematic Sinsky, C.A., and Gilchrist, V.J. (2017). Tethered to the EHR: primary
Review Assistant-Deduplication Module. Syst. Rev. 4, 6. care physician workload assessment using EHR event log data and
32. Cleo, G., Scott, A.M., Islam, F., Julien, B., and Beller, E. (2019). Usability time-motion observations. Ann. Fam. Med. 15, 419–426.
and acceptability of four systematic review automation software pack- 50. Kroth, P.J., Morioka-Douglas, N., Veres, S., Pollock, K., Babbott, S.,
ages: a mixed method design. Syst. Rev. 8, 145. Poplau, S., Corrigan, K., and Linzer, M. (2018). The electronic elephant
33. Guyatt, G.H., Oxman, A.D., Kunz, R., Vist, G.E., Falck-Ytter, Y., and in the room: physicians and the electronic health record. JAMIA Open
€nemann, H.J. (2008). What is ‘‘quality of evidence’’ and why is it
Schu 1, 49–56.
important to clinicians? BMJ 336, 995–998. https://doi.org/10.1136/ 51. Wachter, R., and Goldsmith, J. (2018). To combat physician burnout and
bmj.39490.551019.BE. improve care, fix the electronic health record. Harv. Bus. Rev.
34. Marshall, I.J., Kuiper, J., and Wallace, B.C. (2015). RobotReviewer: eval- 52. Coiera, E. (2019). The price of artificial intelligence. Yearb. Med. Inform.
uation of a system for automatically assessing bias in clinical trials. J. Am. 28, 014–015.
Med. Inf. Assoc. 23, 193–201.
53. Kocaballi, A.B., Coiera, E., Tong, H.L., White, S.J., Quiroz, J.C., Rezaza-
35. Torres Torres, M., and Adams, C.E. (2017). RevManHAL: towards auto- degan, F., Willcock, S., and Laranjo, L. (2019). A network model of activ-
matic text generation in systematic reviews. Syst. Rev. 6, 27. ities in primary care consultations. J. Am. Med. Inform. Assoc. 26,
36. Tsafnat, G., Glasziou, P., Choong, M.K., Dunn, A., Galgani, F., and 1074–1082.
Coiera, E. (2014). Systematic review automation technologies. Syst. 54. Finley, G., Edwards, E., Robinson, A., Brenndoerfer, M., Sadoughi, N.,
Rev. 3, 74. Fone, J., et al. (2018). An Automated Medical Scribe for Documenting
37. O’Connor, A.M., Tsafnat, G., Gilbert, S.B., Thayer, K.A., Shemilt, I., Clinical Encounters. Proceedings of the 2018 Conference of the North
Thomas, J., Glasziou, P., and Wolfe, M.S. (2019). Still moving toward American Chapter of the Association for Computational Linguistics:
automation of the systematic review process: a summary of discussions Demonstrations, pp. 11–15.
at the third meeting of the International Collaboration for Automation of 55. Hodgson, T., and Coiera, E. (2016). Risks and benefits of speech recog-
Systematic Reviews (ICASR). Syst. Rev. 8, 57. https://doi.org/10.1186/ nition for clinical documentation: a systematic review. J. Am. Med.
s13643-019-0975-y. Inform. Assoc. 23, e169–e179. https://doi.org/10.1093/jamia/ocv152.
38. White, H., Tendal, B., Elliott, J., Turner, T., Andrikopoulos, S., and Zoun- 56. Quiroz, J.C., Laranjo, L., Kocaballi, A.B., Briatore, A., Berkovsky, S., Re-
gas, S. (2020). Breathing life into Australian diabetes clinical guidelines. zazadegan, D., and Coiera, E. (2020). Identifying relevant information in
Med. J. Aust. 212, 250–251.e1. medical conversations to summarize a clinician-patient encounter.
39. Tsafnat, G., Glasziou, P., Karystianis, G., and Coiera, E. (2018). Auto- Health Informatics J. 26, 2906–2914.
mated screening of research studies for systematic reviews using study 57. Hodgson, T., Magrabi, F., and Coiera, E. (2017). Efficiency and safety of
characteristics. Syst. Rev. 7, 64. https://doi.org/10.1186/s13643-018- speech recognition for documentation in the electronic health record.
0724-7. J. Am. Med. Inform. Assoc. 24, 1127–1133. https://doi.org/10.1093/ja-
40. Sim, I., Olasov, B., and Carini, S. (2003). The Trial Bank system: capturing mia/ocx073.
randomized trials for evidence-based medicine. AMIA Annu. Symp. 58. Wang, J., Lavender, M., Hoque, E., Brophy, P., and Kautz, H. (2021). A
Proc., 1076. patient-centered digital scribe for automatic medical documentation.
41. Alper, B.S., Richardson, J.E., Lehmann, H.P., and Subbian, V. (2020). It is JAMIA open 4, ooab003.
time for computable evidence synthesis: the COVID-19 Knowledge 59. Park, J., Kotzias, D., Kuo, P., Logan Iv, R.L., Merced, K., Singh, S., Ta-
Accelerator initiative. J. Am. Med. Inform. Assoc. 27, 1338–1339. nana, M., Karra Taniskidou, E., Lafata, J.E., Atkins, D.C., et al. (2019). De-
https://doi.org/10.1093/jamia/ocaa114. tecting conversation topics in primary care office visits from transcripts of
42. Dunn, A.G., and Bourgeois, F.T. (2020). Is it time for computable evi- patient-provider interactions. J. Am. Med. Inform. Assoc. 26, 1493–1504.
dence synthesis? J. Am. Med. Inform. Assoc. 27, 972–975. https://doi. 60. Lacson, R.C., Barzilay, R., and Long, W.J. (2006). Automatic analysis of
org/10.1093/jamia/ocaa035. medical dialogue in the home hemodialysis domain: structure induction
43. Gallego, B., Dunn, A.G., and Coiera, E. (2013). Role of electronic health and summarization. J. Biomed. Inform. 39, 541–555.
records in comparative effectiveness research. J. Comp. Eff. Res. 2, 61. Osborne, J.D., Lin, S., Zhu, L.J., and Kibbe, W.A. (2007). Mining biomed-
529–532. https://doi.org/10.2217/cer.13.65. ical data using MetaMap transfer (MMtx) and the unified medical
44. Gallego, B., Walter, S.R., Day, R.O., Dunn, A.G., Sivaraman, V., Shah, N., language system (UMLS). In Gene Function Analysis (Springer),
Longhurst, C.A., and Coiera, E. (2015). Bringing cohort studies to the pp. 153–169.
bedside: framework for a "green button’ to support clinical decision- 62. van Buchem, M.M., Boosman, H., Bauer, M.P., Kant, I.M.J., Cammel,
making. J. Comp. Eff. Res. 4, 191–197. https://doi.org/10.2217/cer. S.A., and Steyerberg, E.W. (2021). The digital scribe in clinical practice:
15.12. a scoping review and research agenda. NPJ Digit. Med. 4, 57–58.
45. Klann, J.G., and Szolovits, P. (2009). An Intelligent Listening Framework 63. Navarro, D.F., Dras, M., and Berkovsky, S. (2022). Few-shot Fine-Tuning
for Capturing Encounter Notes from a Doctor-Patient Dialog. BMC Med- SOTA Summarization Models for Medical Dialogues. Proceedings of the
ical Informatics and Decision Making9, p. S3. 2022 Conference of the North American Chapter of the Association for
46. Lin, S.Y., Shanafelt, T.D., and Asch, S.M. (2018). Reimagining Clinical Computational Linguistics: Human Language Technologies: Student
Documentation with Artificial Intelligence, 5 (Elsevier), pp. 563–565. Research Workshop, pp. 254–266.
47. Coiera, E., Kocaballi, B., Halamka, J., and Laranjo, L. (2018). The digital 64. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-
scribe. NPJ Digit. Med. 1, 58. https://doi.org/10.1038/s41746-018- training of Deep Bidirectional Transformers for Language Understanding.
0066-9. Preprint at arXiv. https://doi.org/10.48550/ARXIV.1810.04805.
48. Shanafelt, T.D., West, C.P., Sinsky, C., Trockel, M., Tutty, M., Satele, 65. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P.,
D.V., Carlasare, L.E., and Dyrbye, L.N. (2019). Changes in Burnout Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). language
and Satisfaction with Work-Life Integration in Physicians and the models are few-shot learners. In Advances in Neural Information Pro-
General US Working Population between 2011 and 2017, 9 (Elsevier), cessing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
pp. 1681–1694. and H. Lin, eds. (Curran Associates, Inc).

ll
66. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., 86. Laranjo, L., Dunn, A.G., Tong, H.L., Kocaballi, A.B., Chen, J., Bashir, R.,
Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning Trans- Surian, D., Gallego, B., Magrabi, F., Lau, A.Y.S., and Coiera, E. (2018).
ferable Visual Models from Natural Language Supervision (PMLR), Conversational agents in healthcare: a systematic review. J. Am. Med.
pp. 8748–8763. Inform. Assoc. 25, 1248–1258.
67. Dou, Y., Forbes, M., Koncel-Kedziorski, R., Smith, N., and Choi, Y. 87. Coiera, E. (2019). The last mile: where artificial intelligence meets reality.
(2022). Is GPT-3 text Indistinguishable from Human Text? Scarecrow: J. Med. Internet Res. 21, e16323.
A Framework for Scrutinizing Machine Text (Association for Computa- 88. Topol, E.J. (2019). High-performance medicine: the convergence of hu-
tional Linguistics). https://doi.org/10.18653/v1/2022.acl-long.501. man and artificial intelligence. Nat. Med. 25, 44–56.
68. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hier- 89. Coiera, E., Ammenwerth, E., Georgiou, A., and Magrabi, F. (2018). Does
archical Text-Conditional Image Generation with CLIP Latents. Preprint health informatics have a replication crisis? J. Am. Med. Inform. Assoc.
at arXiv. https://doi.org/10.48550/ARXIV.2204.06125. 25, 963–968. https://doi.org/10.1093/jamia/ocy028.
69. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, 90. Akbar, S., Coiera, E., and Magrabi, F. (2020). Safety concerns with con-
A.N., Kaiser, L.u., and Polosukhin, I. (2017). Attention is all you need. In sumer-facing mobile health applications and their consequences: a
Advances in Neural Information Processing Systems, I. Guyon, U. Von scoping review. J. Am. Med. Inform. Assoc. 27, 330–340. https://doi.
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- org/10.1093/jamia/ocz175.
nett, eds. (Curran Associates, Inc).
91. Cabitza, F., Rasoini, R., and Gensini, G.F. (2017). Unintended conse-
70. Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M., and Villalobos, quences of machine learning in medicine. JAMA 318, 517–518.
P. (2022). Compute Trends across Three Eras of Machine Learning. Pre-
92. Gordon, M., Viganola, D., Bishop, M., Chen, Y., Dreber, A., Goldfedder,
print at arXiv. https://doi.org/10.48550/arXiv.2107.01294.
B., Holzmeister, F., Johannesson, M., Liu, Y., Twardy, C., et al. (2020).
71. Heikkilä, M. (2022). Inside a Radical New Project to Democratize AI. MIT Are Replication Rates The Same Across Academic Fields? Community
Technology Review. https://www.technologyreview.com/2022/07/12/ forecasts from the DARPA SCORE programme7 (Royal Society Open
1055817/inside-a-radical-new-project-to-democratize-ai/. Science), p. 200566. https://doi.org/10.1098/rsos.200566.
72. LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521, 93. Mathieu, S., Boutron, I., Moher, D., Altman, D.G., and Ravaud, P. (2009).
436–444. https://doi.org/10.1038/nature14539. Comparison of registered and published primary outcomes in random-
73. Sabour, S., Frosst, N., and Hinton, G.E. (2017). Dynamic routing between ized controlled trials. JAMA 302, 977–984.
capsules (Curran Associates Inc.), pp. 3859–3869. Held in Long Beach. 94. Simonsohn, U., Simmons, J.P., and Nelson, L.D. (2015). Better P-curves:
74. Pearl, J., and Mackenzie, D. (2018). The Book of Why: The New Science making P-curve analysis more robust to errors, fraud, and ambitious
of Cause and Effect (Basic Books, Inc.). P-hacking, a Reply to Ulrich and Miller (2015). J. Exp. Psychol. Gen.
144, 1146–1152. https://doi.org/10.1037/xge0000104.
75. Marcus, G. (2018). Deep Learning: A Critical Appraisal. Preprint at arXiv.
https://doi.org/10.48550/ARXIV.1801.00631. 95. Chalmers, l. (1990). Underreporting research is scientific misconduct.
JAMA 263, 1405–1408. https://doi.org/10.1001/jama.1990.034401001
76. Kahneman, D. (2011). Thinking, Fast and Slow (Farrar, Straus and 21018.
Giroux).
96. Macleod, M.R., Lawson McLean, A., Kyriakopoulou, A., Serghiou, S., de
77. Piloto, L.S., Weinstein, A., Battaglia, P., and Botvinick, M. (2022). Intuitive Wilde, A., Sherratt, N., Hirst, T., Hemblade, R., Bahor, Z., Nunes-Fon-
physics learning in a deep-learning model inspired by developmental seca, C., et al. (2015). Risk of bias in reports of in vivo research: a focus
psychology. Nat. Hum. Behav. 6, 1257–1267. https://doi.org/10.1038/ for improvement. PLoS Biol. 13, e1002273.
s41562-022-01394-8.
97. Curtis, M.J., and Abernethy, D.R. (2015). Replication – why we need to
78. Hayes, P.J. (1979). The Naive Physics Manifesto (Edinburgh University publish our findings. Pharmacol. Res. Perspect. 3, e00164. https://doi.
Press). Expert Systems in the Microelectronic Age. org/10.1002/prp2.164.
79. Hayes, P.J. (1985). The second naive physics manifesto. In Formal The- 98. Liu, J.L.Y., and Wyatt, J.C. (2011). The case for randomized controlled tri-
ories of the Common-Sense World, J.R. Hobbs and R.C. Moore, eds. als to assess the impact of clinical information systems. J. Am. Med.
(Norwoord). Inform. Assoc. 18, 173–180. https://doi.org/10.1136/jamia.2010.010306.
80. Bobrow, D.G. (1984). Qualitative reasoning about physical systems: an 99. Yan, L., Zhang, H.-T., Goncalves, J., Xiao, Y., Wang, M., Guo, Y., Sun, C.,
introduction. Artif. Intell. 24, 1–5. Tang, X., Jing, L., Zhang, M., et al. (2020). An interpretable mortality pre-
81. Marcus, G.F. (2003). The Algebraic Mind: Integrating Connectionism and diction model for COVID-19 patients. Nat. Mach. Intell. 2, 283–288.
Cognitive Science (MIT press). 100. Barish, M., Bolourani, S., Lau, L.F., Shah, S., and Zanos, T.P. (2020).
82. Dash, T., Chitlangia, S., Ahuja, A., and Srinivasan, A. (2022). A review of External validation demonstrates limited clinical utility of the interpretable
some techniques for inclusion of domain-knowledge into deep neural mortality prediction model for patients with COVID-19. Nat. Mach. Intell.
networks. Sci. Rep. 12, 1040. https://doi.org/10.1038/s41598-021- 3, 25–27. https://doi.org/10.1038/s42256-020-00254-2.
04590-0. 101. Quanjel, M.J.R., van Holten, T.C., Gunst-van der Vliet, P.C., Wielaard, J.,
83. Li, Y., Sperrin, M., Ashcroft, D.M., and van Staa, T.P. (2020). Consistency Karakaya, B., Söhne, M., Moeniralam, H.S., and Grutters, J.C. (2020).
of variety of machine learning and statistical models in predicting clinical Replication of a mortality prediction model in Dutch patients with
risks of individual patients: longitudinal cohort study using cardiovascular COVID-19. Nat. Mach. Intell. 3, 23–24.
disease as exemplar. BMJ 371, m3919. https://doi.org/10.1136/bmj. 102. Goncalves, J., Yan, L., Zhang, H.-T., Xiao, Y., Wang, M., Guo, Y., Sun, C.,
m3919. Tang, X., Cao, Z., Li, S., et al. (2020). Li Yan et al. reply. Nature Machine
84. Chen, T., and Guestrin, C. (2016). Xgboost: A Scalable Tree Boosting Intelligence. https://doi.org/10.1038/s42256-020-00251-5.
System. KDD ’16: Proceedings of the 22nd ACM SIGKDD International 103. Dupuis, C., De Montmollin, E., Neuville, M., Mourvillier, B., Ruckly, S., and
Conference on Knowledge Discovery and Data Mining, pp. 785–794. Timsit, J.F. (2020). Limited applicability of a COVID-19 specific mortality
85. Yusuf, M., Atal, I., Li, J., Smith, P., Ravaud, P., Fergie, M., Callaghan, M., prediction rule to the intensive care setting. Nat. Mach. Intell. 3, 20–22.
and Selfe, J. (2020). Reporting quality of studies using machine learning 104. Coiera, E., and Tong, H.L. (2021). Replication studies in the clinical deci-
models for medical diagnosis: a systematic review. BMJ Open 10, sion support literature – frequency, fidelity and impact. J. Am. Med.
e034568. https://doi.org/10.1136/bmjopen-2019-034568. Inform. Assoc. 28, 1815–1825.

ll
105. Han, Y.Y., Carcillo, J.A., Venkataraman, S.T., Clark, R.S.B., Watson, for the safe use of artificial intelligence in patient care. BMJ Health
R.S., Nguyen, T.C., Bayir, H., and Orr, R.A. (2005). Unexpected increased Care Inform. 26, e100081.
mortality after implementation of a commercially sold computerized 119. Scott, I.A., Cook, D., Coiera, E.W., and Richards, B. (2019). Machine
physician order entry system. Pediatrics 116, 1506–1512. learning in clinical practice: prospects and pitfalls. Med. J. Aust. 211,
106. Haibe-Kains, B., Adam, G.A., Hosny, A., Khodakarami, F., Massive Anal- 203–205.e1.
ysis Quality Control MAQC Society Board of Directors; Waldron, L.,
120. Tangari, G., Ikram, M., Ijaz, K., Kaafar, M.A., and Berkovsky, S. (2021).
Wang, B., McIntosh, C., Goldenberg, A., Kundaje, A., et al. (2020). Trans-
Mobile health and privacy: cross sectional study. BMJ 373, n1248.
parency and reproducibility in artificial intelligence. Nature 586, E14–E16.
https://doi.org/10.1136/bmj.n1248.
https://doi.org/10.1038/s41586-020-2766-y.
121. Magrabi, F., Habli, I., Sujan, M., Wong, D., Thimbleby, H., Baker, M., and
107. Rivera, S.C., Liu, X., Chan, A.-W., Denniston, A.K., and Calvert, M.J.
Coiera, E. (2019). Why is it so difficult to govern mobile apps in health-
(2020). Guidelines for Clinical Trial Protocols for Interventions Involving
care? BMJ Health Care Inform. 26, e100006. https://doi.org/10.1136/
Artificial Intelligence: The SPIRIT-AI Extension. BMJ 370, m3210.
bmjhci-2019-100006.
108. Liu, X., Cruz Rivera, S., Moher, D., Calvert, M.J., and Denniston, A.K.;
122. Kocaballi, A.B., Quiroz, J.C., Rezazadegan, D., Berkovsky, S., Magrabi,
SPIRIT-AI and CONSORT-AI Working Group (2020). Reporting guide-
F., Coiera, E., and Laranjo, L. (2020). Responses of conversational agents
lines for clinical trial reports for interventions involving artificial intelli-
to health and lifestyle prompts: investigation of appropriateness and pre-
gence: the CONSORT-AI extension. Lancet. Digit. Health 2, e537–
sentation structures. J. Med. Internet Res. 22, e15823. https://doi.org/10.
e548. https://doi.org/10.1016/s2589-7500(20)30218-1.
2196/15823.
109. Bengtsson, E., and Malm, P. (2014). Screening for cervical cancer using
automated analysis of PAP-smears. Comput. Math. Methods Med. 2014, 123. Fraser, H., Coiera, E., and Wong, D. (2018). Safety of patient-facing dig-
842037. https://doi.org/10.1155/2014/842037. ital symptom checkers. Lancet 392, 2263–2264. https://doi.org/10.1016/
S0140-6736(18)32819-8.
110. Magrabi, F., Ong, M.-S., Runciman, W., and Coiera, E. (2010). An analysis
of computer-related patient safety incidents to inform the development of 124. Panch, T., Mattie, H., and Celi, L.A. (2019). The ‘‘inconvenient truth’’
a classification. J. Am. Med. Inform. Assoc. 17, 663–670. https://doi.org/ about AI in healthcare. NPJ Digit. Med. 2, 77–83.
10.1136/jamia.2009.002444. 125. Chen, J.H., and Asch, S.M. (2017). Machine learning and prediction in
111. Magrabi, F., Ong, M.-S., Runciman, W., and Coiera, E. (2011). Patient medicine - beyond the peak of inflated expectations. N. Engl. J. Med.
safety problems associated with heathcare information technology: an 376, 2507–2509. https://doi.org/10.1056/NEJMp1702071.
analysis of adverse events reported to the US Food and Drug Administra- 126. Finlayson, S.G., Subbaswamy, A., Singh, K., Bowers, J., Kupke, A., Zit-
tion. In AMIA Annual Symposium Proceedings (American Medical Infor- train, J., Kohane, I.S., and Saria, S. (2021). The clinician and dataset shift
matics Association). in artificial intelligence. N. Engl. J. Med. 385, 283–286. https://doi.org/10.
112. Magrabi, F., Ong, M.S., Runciman, W., and Coiera, E. (2012). Using FDA 1056/NEJMc2104626.
reports to inform a classification for health information technology safety
127. Coiera, E. (2016). Chapter 12: implementation. In Guide to Health Infor-
problems. J. Am. Med. Inform. Assoc. 19, 45–53.
matics (CRC Press), pp. 173–194.
113. Chinese State Council (2017). NewgenerationofArtificialIntelligence
128. Hasson, H. (2010). Systematic evaluation of implementation fidelity of
Development Plan. https://flia.org/noticestate-council-issuing-new-
complex interventions in health and social care. Implement. Sci. 5, 67.
generation-artificial-intelligence-developmentplan/.
129. Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Gha-
114. European Commission (2021). Proposal for a Regulation of the European
vamzadeh, M., Fieguth, P., Cao, X., Khosravi, A., Acharya, U.R., et al.
Parliament and of the Council Laying Down Harmonised Rules on
(2021). A review of uncertainty quantification in deep learning: tech-
artificial Intelligence (Artificial Intelligence Act) and Amending Certain
niques, applications and challenges. Inf. Fusion 76, 243–297. https://
Union Legislative Acts {SEC(2021) 167 final} - {SWD(2021) 84 final} -
doi.org/10.1016/j.inffus.2021.05.008.
{SWD(2021) 85 final}. https://eur-lex.europa.eu/legal-content/EN/TXT/?
uri=CELEX%3A52021PC0206. 130. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K.Q. (2017). On Calibration
115. The U.S. Food and Drug Administration (FDA). Artificial Intelligence of Modern Neural Networks (PMLR), pp. 1321–1330.
and Machine Learning (AI/ML)-Enabled Medical Devices. https:// 131. Wang, L., Ju, L., Zhang, D., Wang, X., He, W., Huang, Y., Yang, Z., Yao,
www.fda.gov/medical-devices/software-medical-device-samd/artificial- X., Zhao, X., Ye, X., and Ge, Z. (2021). Medical Matting: A New Perspec-
intelligence-and-machine-learning-aiml-enabled-medical-devices. tive on Medical Segmentation with Uncertainty (Springer International
116. Challen, R., Denny, J., Pitt, M., Gompels, L., Edwards, T., and Tsaneva- Publishing), pp. 573–583. held in Cham.
Atanasova, K. (2019). Artificial intelligence, bias and clinical safety. BMJ 132. Raghuram, J., Chandrasekaran, V., Jha, S., and Banerjee, S. (2021). A
Qual. Saf. 28, 231–237. General Framework for Detecting Anomalous Inputs to DNN Classifiers
117. Lyell, D., and Coiera, E. (2017). Automation bias and verification (PMLR), pp. 8764–8775.
complexity: a systematic review. J. Am. Med. Inform. Assoc. 24, 133. Liu, S., Shah, Z., Sav, A., Russo, C., Berkovsky, S., Qian, Y., Coiera, E.,
423–431. https://doi.org/10.1093/jamia/ocw105. and Di Ieva, A. (2020). Isocitrate dehydrogenase (IDH) status prediction
118. Sujan, M., Furniss, D., Grundy, K., Grundy, H., Nelson, D., Elliott, M., in histopathology images of gliomas using deep learning. Sci. Rep. 10,
White, S., Habli, I., and Reynolds, N. (2019). Human factors challenges 7733. https://doi.org/10.1038/s41598-020-64588-y.

03 PDF

Uploaded by

Copyright:

Available Formats

03 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 PDF

Uploaded by

Copyright:

Available Formats

ll

NSW 2109, Australia

INTRODUCTION tified productivity improvement from smart automation worth

about 270,000 scientific articles from 8,000 journals by July

2 Cell Reports Medicine 3, 100860, December 20, 2022

Table 1. Automation tools can support different stages of systematic review

Cell Reports Medicine 3, 100860, December 20, 2022 3

Figure 2. The automation of systematic review

4 Cell Reports Medicine 3, 100860, December 20, 2022

Box 1. Foundation models

Cell Reports Medicine 3, 100860, December 20, 2022 5

Box 2. Deep learning 2.0

6 Cell Reports Medicine 3, 100860, December 20, 2022

ral networks are focused on predicting probability estimates that

Cell Reports Medicine 3, 100860, December 20, 2022 7

8 Cell Reports Medicine 3, 100860, December 20, 2022

Cell Reports Medicine 3, 100860, December 20, 2022 9

10 Cell Reports Medicine 3, 100860, December 20, 2022

Cell Reports Medicine 3, 100860, December 20, 2022 11

You might also like