GRADE Handbook

1.
Overview of the GRADE Approach

1.1 Purpose and advantages of the GRADE
approach
1.2 Separation of confidence in effect estimates
from strength of recommendations
1.3 Special challenges in applying the the GRADE
approach
GRADE Handbook
1.4 Modifications to the GRADE approach Introduction to GRADE Handbook
2. Framing the health care question Handbook for grading the quality of evidence and the strength of recommendations using the GRADE
2.1 Defining the patient population and intervention approach. Updated October 2013.
2.2 Dealing with multiple comparators
2.3 Other considerations Editors: Holger Schünemann (schuneh@mcmaster.ca), Jan Brożek (brozekj@mcmaster.ca), Gordon
2.4 Format of health care questions using the Guyatt (guyatt@mcmaster.ca), and Andrew Oxman (oxman@online.no)
GRADE approach
3. Selecting and rating the importance of outcomes
3.1 Steps for considering the relative importance of
outcomes About the Handbook
3.2 Influence of perspective The GRADE handbook describes the process of rating the quality of the best available evidence and
3.3 Using evidence in rating the importance of developing health care recommendations following the approach proposed by the Grading of
outcomes Recommendations, Assessment, Development and Evaluation (GRADE) Working Group
3.4 Surrogate (substitute) outcomes (www.gradeworkinggroup.org). The Working Group is a collaboration of health care methodologists,
4. Summarizing the evidence guideline developers, clinicians, health services researchers, health economists, public health officers
4.1 Evidence Tables and other interested members. Beginning in the year 2000, the working group developed, evaluated and
4.2 GRADE Evidence Profile implemented a common, transparent and sensible approach to grading the quality of evidence and
4.3 Summary of Findings table strength of recommendations in health care. The group interacts through meetings by producing
5. Quality of evidence methodological guidance, developing evidence syntheses and guidelines. Members collaborate on
5.1 Factors determining the quality of evidence research projects, such as the DECIDE project (www.decidecollaboration.eu) with other members and
other scientists or organizations (e.g. www.rarebestpractices.eu). Membership is open and free. See
5.1.1 Study design
www.gradeworkinggroup.org and Chapter The GRADE working group in this handbook for more
5.2 Factors that can reduce the quality of the information about the Working Group and a list of the organizations that have endorsed and adopted the
evidence GRADE approach.
5.2.1 Study limitations (Risk of Bias)
5.2.2 Inconsistency of results The handbook is intended to be used as a guide by those responsible for using the GRADE approach to
5.2.2.1 Deciding whether to use estimates produce GRADE's output, which includes evidence summaries and graded recommendations. Target
from a subgroup analysis users of the handbook are systematic review and health technology assessment (HTA) authors, guideline
5.2.3 Indirectness of evidence panelists and methodologists who provide support for guideline panels. While many of the examples
offered in the handbook are clinical examples, we also aimed to include a broader range of examples
5.2.4 Imprecision
from public health and health policy. Finally, specific sections refer to interpreting recommendations for
5.2.4.1 Imprecision in guidelines users of recommendations.
5.2.4.2 Imprecision in in systematic
reviews Using the Handbook
5.2.4.3 Rating down two levels for The handbook is divided into chapters that correspond to the steps of applying the GRADE approach.
imprecision The Chapter Overview of the GRADE approach provides a brief overview of guideline development
5.2.5 Publication bias processes and where the GRADE approach fits in. Chapters Framing the health care question and
5.3. Factors that can increase the quality of the Selecting and rating the importance of outcomes provide guidance on formulating health care questions
evidence for guidelines and systematic reviews and for rating the importance of outcomes in guidelines. The
5.3.1 Large magnitude of an effect Chapter Summarizing the evidence covers evidence summaries produced using the GRADE software.
5.3.2. Doseresponse gradient GRADE acknowledges that alternative terms or expressions to what GRADE called quality of evidence
5.3.3. Effect of plausible residual confounding are often appropriate. Therefore, we interpret and will use the phrases quality of evidence, strength of
5.4 Overall quality of evidence evidence, certainty in evidence or confidence in estimates interchangeably. When GRADE uses the
6. Going from evidence to recommendations phrase “confidence in estimates” it does not refer to statistical confidence intervals, although the width
6.1 Recommendations and their strength of this interval is part of the considerations for judging the GRADE criterion imprecision. When
6.1.1 Strong recommendation GRADE refers to confidence in the estimates it refers to the how certain one can be that the effect
estimates are adequate to support a recommendation (in the context of guideline development) or that the
6.1.2 Weak recommendation
effect estimate is close to that of the true effect (in the context of evidence synthesis). Chapter Quality
6.1.3 Recommendations to use interventions of evidence provides instructions for rating the evidence and addresses the five factors outlined in the
only in research GRADE approach that may result in rating down the quality of evidence and the three factors that may
6.1.4 No recommendation increase the quality of evidence. Chapter Going from evidence to recommendations deals with moving
6.2 Factors determining direction and strength of from evidence to recommendations in guidelines and whether to classify recommendations as strong or
recommendations weak according to the criteria outlined in the GRADE evidence to recommendation frameworks. The
6.2.1 Balance of desirable and undesirable Chapter The GRADE approach for diagnostic tests and strategies addresses how to use the GRADE
consequences approach specifically for questions about diagnostic tests and strategies. Finally, the Chapter Criteria for
6.2.1.1 Estimates of the magnitude of the determining whether the GRADE approach was used provides the suggested criteria that should be met
desirable and undesirable effects in order to state that the GRADE approach was used.
6.2.1.2 Best estimates of values and Throughout the handbook certain terms and concepts are hyperlinked to access definitions and the
preferences specific sections elaborating on those concepts. The glossary of terms and concepts is provided in the
6.3.2 Confidence in best estimates of Chapter Glossary of terms and concepts. Where applicable, the handbook highlights guidance that is
magnitude of effects (quality of evidence) specific to guideline developers or to systematic review authors as well as important notes pertaining to
6.3.3 Confidence in values and preferences specific topics. HTA practitioners, depending on their mandate, can decide which approach is more
6.3.4 Resource use (cost) suitable for their goals. Furthermore, examples demonstrating the application of the concepts are
6.3.4.1 Differences between costs and provided for each topic. The examples are cited if readers wish to learn more about them from the
other outcomes source documents.
6.3.4.2 Perspective Updating the Handbook
6.3.4.3 Resource implications considered
6.3.4.4 Confidence in the estimates of The handbook is updated to reflect advances in the GRADE approach and based on feedback from
resource use (quality of the evidence about handbook users. It includes information from the published documents about the GRADE approach,
cost) which are listed in the Chapter Articles about GRADE, and links to resources in the Chapter Additional
6.3.4.5 Presentation of resource use resources.
6.3.4.6 Economic model We encourage users of the handbook to provide feedback and corrections to the handbook editors via
6.3.4.7 Consideration of resource use in email.
recommendations Accompanying software: GRADEpro and the Guideline Development Tool (GDT)
6.4 Presentation of recommendations
6.4.1 Wording of recommandations This handbook is intended to accompany the GRADE profiler (GRADEpro) – software to facilitate
6.3.2 Symbolic representation development of evidence summaries and health care recommendations using the GRADE approach –
6.4.3 Providing transparent statements about integrated in the Guideline Development Tool (GDT) “Das tool”. Please refer to
assumed values and preferences www.guidelinedevelopment.org for more information.
6.5 The EvidencetoDecision framework Reproduction and translation
7. The GRADE approach for diagnostic tests and Permission to reproduce or translate the GRADE handbook for grading the quality of evidence and the
strategies strength of recommendation should be sought from the editors.
7.1. Questions about diagnostic tests
7.1.1. Establishing the purpose of a test Acknowledgements
7.1.2. Establishing the role of a test We would particularly like to acknowledge the contributions of Roman Jaeschke, Robin Harbour and
7.1.3. Clear clinical questions Elie Akl to earlier versions of the handbook.
7.2. Gold standard and reference test
Major Contributors
1. Overview of the GRADE Approach Handbook Editors
1.1 Purpose and advantages of the GRADE Holger Schünemann, McMaster University, Hamilton, Canada
1.1 Purpose and advantages of the GRADE Holger Schünemann, McMaster University, Hamilton, Canada
approach
Jan Brożek, McMaster University, Hamilton, Canada
from strength of recommendations Gordon Guyatt, McMaster University, Hamilton, Canada
1.3 Special challenges in applying the the GRADE Andrew Oxman, Norwegian Knowledge Centre for the Health Services, Oslo, Norway
approach
1.4 Modifications to the GRADE approach Chapter Authors and Editors
2. Framing the health care question The following authors have made major contributions to the current version of the handbook: Elie Akl,
2.1 Defining the patient population and intervention Reem Mustafa, Nancy Santesso Wojtek Wiercioch, and. Many other members of the GRADE Working
2.2 Dealing with multiple comparators Group have also contributed to this handbook by providing feedback and through discussion.
2.3 Other considerations
2.4 Format of health care questions using the
GRADE approach
outcomes
3.2 Influence of perspective
1. Overview of the GRADE Approach
3.3 Using evidence in rating the importance of
outcomes The GRADE approach is a system for rating the quality of a body of evidence in systematic reviews and
3.4 Surrogate (substitute) outcomes other evidence syntheses, such as health technology assessments, and guidelines and grading
4. Summarizing the evidence recommendations in health care. GRADE offers a transparent and structured process for developing and
4.1 Evidence Tables presenting evidence summaries and for carrying out the steps involved in developing recommendations.
4.2 GRADE Evidence Profile It can be used to develop clinical practice guidelines (CPG) and other health care recommendations (e.g.
4.3 Summary of Findings table in public health, health policy and systems and coverage decisions).
5. Quality of evidence Figure 1 shows the steps and involvement in a guideline development process (Schü nemann H et al.,
5.1 Factors determining the quality of evidence CMAJ, 2013).
5.1.1 Study design
5.2 Factors that can reduce the quality of the
evidence
5.2.2 Inconsistency of results
5.2.2.1 Deciding whether to use estimates
from a subgroup analysis
5.2.3 Indirectness of evidence
5.2.4 Imprecision
5.2.4.1 Imprecision in guidelines
reviews
5.2.4.3 Rating down two levels for
imprecision
5.2.5 Publication bias
5.3. Factors that can increase the quality of the
evidence
5.3.1 Large magnitude of an effect
5.3.2. Doseresponse gradient
5.3.3. Effect of plausible residual confounding
5.4 Overall quality of evidence
6. Going from evidence to recommendations
6.1 Recommendations and their strength
6.1.1 Strong recommendation
6.1.3 Recommendations to use interventions
only in research
6.1.4 No recommendation Steps and processes are interrelated and not necessarily sequential. The guideline panel and
6.2 Factors determining direction and strength of supporting groups (e.g. methodologist, health economist, systematic review team, secretariat for
recommendations administrative support) work collaboratively, informed through consumer and stakeholder
6.2.1 Balance of desirable and undesirable involvement. They typically report to an oversight committee or board overseeing the process. For
consequences example, while deciding how to involve stakeholders early for priority setting and topic selection,
6.2.1.1 Estimates of the magnitude of the the guideline group must also consider how developing formal relationships with the stakeholders
desirable and undesirable effects will enable effective dissemination and implementation to support uptake of the guideline.
6.2.1.2 Best estimates of values and Furthermore, considerations for organization, planning and training encompass the entire guideline
preferences development project, and steps such as documenting the methodology used and decisions made, as
well as considering con阇lict‐of‐interest occur throughout the entire process.
6.3.2 Confidence in best estimates of
magnitude of effects (quality of evidence) The system is designed for reviews and guidelines that examine alternative management strategies or
6.3.3 Confidence in values and preferences interventions, which may include no intervention or current best management as well as multiple
6.3.4 Resource use (cost) comparisons. GRADE has considered a wide range of clinical questions, including diagnosis, screening,
6.3.4.1 Differences between costs and prevention, and therapy. Guidance specific to applying the GRADE approach to questions about
other outcomes diagnosis is offered in Chapter The GRADE approach for diagnostic tests and strategies
6.3.4.2 Perspective GRADE provides a framework for specifying health care questions, choosing outcomes of interest and
6.3.4.3 Resource implications considered rating their importance, evaluating the available evidence, and bringing together the evidence with
6.3.4.4 Confidence in the estimates of considerations of values and preferences of patients and society to arrive at recommendations.
resource use (quality of the evidence about Furthermore, the system provides clinicians and patients with a guide to using those recommendations in
cost) clinical practice and policy makers with a guide to their use in health policy.
6.3.4.5 Presentation of resource use Application of the GRADE approach begins by defining the health care question in terms of the
6.3.4.6 Economic model population of interest, the alternative management strategies (intervention and comparator), and all
6.3.4.7 Consideration of resource use in patientimportant outcomes. As a specific step for guideline developers, the outcomes are rated
recommendations according to their importance, as either critical or important but not critical. A systematic search is
6.4 Presentation of recommendations preformed to identify all relevant studies and data from the individual included studies is used to
6.4.1 Wording of recommandations generate an estimate of the effect for each patientimportant outcome as well as a measure of the
6.3.2 Symbolic representation uncertainty associated with that estimate (typically a confidence interval). The quality of evidence for
6.4.3 Providing transparent statements about each outcome across all the studies (i.e. the body of evidence for an outcome) is rated according to the
assumed values and preferences factors outlined in the GRADE approach, including five factors that may lead to rating down the quality
6.5 The EvidencetoDecision framework of evidence and three factors that may lead to rating up. Authors of systematic reviews complete the
7. The GRADE approach for diagnostic tests and process up to this step, while guideline developers continue with the subsequent steps. Health care
strategies related related tests and strategies are considered interventions (or comparators) as utilizing a test
7.1. Questions about diagnostic tests inevitably has consequences that can be considered outcomes (see Chapter The GRADE approach for
diagnostic tests and strategies).
7.1.1. Establishing the purpose of a test
7.1.2. Establishing the role of a test Next, guideline developers review all the information from the systematic search and, if needed,
7.1.3. Clear clinical questions reassess and make a final decision about which outcomes are critical and which are important given the
7.2. Gold standard and reference test recommendations that they aim to formulate. The overall quality of evidence across all outcomes is
assigned based on this assessment. Guideline developers then formulate the recommendation(s) and
1. Overview of the GRADE Approach consider the direction (for or against) and grade the strength (strong or weak) of the recommendation(s)
1.1 Purpose and advantages of the GRADE based on the criteria outlined in the GRADE approach. Figure 2 provides a schematic view of the
approach GRADE approach.
Figure 2: A schematic view of the GRADE approach for synthesizing evidence and developing
1.3 Special challenges in applying the the GRADE recommendations. The upper half describe steps in the process common to systematic reviews and
1.3 Special challenges in applying the the GRADE recommendations. The upper half describe steps in the process common to systematic reviews and
approach making health care recommendations and the lower half describe steps that are specific to making
1.4 Modifications to the GRADE approach recommendations (based on GRADE meeting, Edingburgh 2009).
2. Framing the health care question
2.1 Defining the patient population and intervention
GRADE approach
outcomes
outcomes
3.4 Surrogate (substitute) outcomes
4. Summarizing the evidence
4.1 Evidence Tables
4.2 GRADE Evidence Profile
4.3 Summary of Findings table
5. Quality of evidence
5.1 Factors determining the quality of evidence
5.1.1 Study design
evidence
5.2.4 Imprecision
reviews
5.2.4.3 Rating down two levels for For authors of systematic reviews:
imprecision
5.2.5 Publication bias Systematic reviews should provide a comprehensive summary of the evidence but they should typically
5.3. Factors that can increase the quality of the not include health care recommendations. Therefore, use of the GRADE approach by systematic review
evidence authors terminates after rating the quality of evidence for outcomes and clearly presenting the results in
5.3.1 Large magnitude of an effect an evidence table, i.e. an GRADE Evidence Profile or a Summary of Findings table. Those developing
health care recommendations, e.g. a guideline panel, will have to complete the subsequent steps.
5.3.3. Effect of plausible residual confounding The following chapters will provide detailed guidance about the factors that influence the quality of
5.4 Overall quality of evidence evidence and strength of recommendations as well as instructions and examples for each step in the
6. Going from evidence to recommendations application of the GRADE approach. A detailed description of the GRADE approach for authors of
6.1 Recommendations and their strength systematic reviews and those making recommendations in health care is also available in a series of
6.1.1 Strong recommendation articles published in the Journal of Clinical Epidemiology. An additional overview of the GRADE
6.1.2 Weak recommendation approach as well as quality of evidence and strength of recommendations in guidelines is available in a
6.1.3 Recommendations to use interventions previously published sixpart series in the British Medical Journal. Briefer overviews have appeared in
other journals, primarily with examples for relevant specialties. The articles are listed in Chapter 10.
only in research
This handbook, however, as a resource that exists primarily in electronic format, will include GRADE’s
6.1.4 No recommendation innovations and be kept up to date as journal publications become outdated.
6.2 Factors determining direction and strength of
recommendations
6.2.1 Balance of desirable and undesirable
consequences
6.2.1.1 Estimates of the magnitude of the 1.1 Purpose and advantages of the GRADE approach
desirable and undesirable effects
6.2.1.2 Best estimates of values and
preferences Clinical practice guidelines offer recommendations for the management of typical patients. These
6.3.2 Confidence in best estimates of management decisions involve balancing the desirable and undesirable consequences of a given course
magnitude of effects (quality of evidence) of action. In order to help clinicians make evidencebased medical decisions, guideline developers often
6.3.3 Confidence in values and preferences grade the strength of their recommendations and rate the quality of the evidence informing those
6.3.4 Resource use (cost) recommendations.
6.3.4.1 Differences between costs and Prior grading systems had many disadvantages including the lack of separation between the quality of
other outcomes evidence and strength of recommendation, the lack of transparency about judgments, and the lack of
6.3.4.2 Perspective explicit acknowledgment of values and preferences underlying the recommendations. In addition, the
6.3.4.3 Resource implications considered existence of many, often scientifically outdated, grading systems has created confusion among guideline
6.3.4.4 Confidence in the estimates of developers and end users.
resource use (quality of the evidence about The GRADE approach was developed to overcome these shortcomings of previous grading systems.
cost) Advantages of GRADE over other grading systems include:
6.3.4.5 Presentation of resource use
6.3.4.6 Economic model ● Developed by a widely representative group of international guideline developers
6.3.4.7 Consideration of resource use in ● Clear separation between judging confidence in the effect estimates and strength of
recommendations recommendations
6.4 Presentation of recommendations ● Explicit evaluation of the importance of outcomes of alternative management strategies
6.4.1 Wording of recommandations
6.3.2 Symbolic representation ● Explicit, comprehensive criteria for downgrading and upgrading quality of evidence ratings
6.4.3 Providing transparent statements about ● Transparent process of moving from evidence to recommendations
assumed values and preferences
6.5 The EvidencetoDecision framework ● Explicit acknowledgment of values and preferences
7. The GRADE approach for diagnostic tests and ● Clear, pragmatic interpretation of strong versus weak recommendations for clinicians,
strategies patients, and policy makers
7.1. Questions about diagnostic tests ● Useful for systematic reviews and health technology assessments, as well as guidelines
7.1.2. Establishing the role of a test Note:
7.1.3. Clear clinical questions
7.2. Gold standard and reference test Although the GRADE approach makes judgments about quality of evidence, that is confidence in the
effect estimates, and strength of recommendations in a systematic and transparent manner, it does not
1. Overview of the GRADE Approach eliminate the need for judgments. Thus, applying the GRADE approach does not minimize the
1.1 Purpose and advantages of the GRADE importance of judgment or as suggesting that quality can always be objectively determined.
approach Although evidence suggests that these judgments, after appropriate methodological training, lead to
1.2 Separation of confidence in effect estimates reliable assessment of the quality of evidence (Mustafa R et al., Journal of Clinical Epidemiology,
from strength of recommendations 2013). There will be cases in which those making judgments will have legitimate disagreement about the
1.3 Special challenges in applying the the GRADE interpretation of evidence. GRADE provides a framework guiding through the critical components of the
approach assessment in a structured way. By allowing to make the judgments explicit rather than implicit it
1.4 Modifications to the GRADE approach ensures transparency and a clear basis for discussion.
2.4 Format of health care questions using the 1.2 Separation of confidence in effect estimates from
GRADE approach strength of recommendations
outcomes A number of criteria should be used when moving from evidence to recommendations (see Chapter on
3.2 Influence of perspective Going from evidence to recommendations). During that process, separate judgements are required for
3.3 Using evidence in rating the importance of each of these criteria. In particular, separating judgements about the confidence in estimates or quality
outcomes of evidence from judgements about the strength of recommendations is important as high confidence in
3.4 Surrogate (substitute) outcomes effect estimates does not necessarily imply strong recommendations, and strong recommendations can
4. Summarizing the evidence result from low or even very low confidence in effect estimates (insert link to paradigmatic situations for
4.1 Evidence Tables when strong recommendations are justified in the context of low or very low confidence in effect
estimates). Grading systems that fail to separate these judgements create confusion, while it is the
defining feature of GRADE.
5. Quality of evidence The GRADE approach stresses the necessity to consider the balance between desirable and undesirable
5.1 Factors determining the quality of evidence consequences and acknowledge other factors, for example the values and preferences underlying the
5.1.1 Study design recommendations. As patients with varying values and preferences for outcomes and interventions will
5.2 Factors that can reduce the quality of the make different choices, guideline panels facing important variability in patient values and preferences
evidence are likely to offer a weak recommendation despite high quality evidence. Considering importance of
5.2.1 Study limitations (Risk of Bias) outcomes and interventions, values, preferences and utilities includes integrating in the process of
developing a recommendation, how those affected by its recommendations assess the possible
consequences. These include patient and carer knowledge, attitudes, expectations, moral and ethical
5.2.2.1 Deciding whether to use estimates values, and beliefs; patient goals for life and health; prior experience with the intervention and the
from a subgroup analysis condition; symptom experience (for example breathlessness, pain, dyspnoea, weight loss); preferences
5.2.3 Indirectness of evidence for and importance of desirable and undesirable health outcomes; perceived impact of the condition or
5.2.4 Imprecision interventions on quality of life, wellbeing or satisfaction and interactions between the work of
5.2.4.1 Imprecision in guidelines implementing the intervention, the intervention itself, and other contexts the patient may be experiencing;
5.2.4.2 Imprecision in in systematic preferences for alternative courses of action; and preferences relating to communication content and
reviews styles, information and involvement in decisionmaking and care. This can be related to what in the
5.2.4.3 Rating down two levels for economic literature is considered utilities. An intervention itself can be considered a consequence of a
imprecision recommendation (e.g. the burden of taking a medication or undergoing surgery) and a level of importance
5.2.5 Publication bias or value is associated with that. Both the direction and the strength of a recommendation may be
5.3. Factors that can increase the quality of the modified after taking into account the implications for resource utilization, equity, acceptability and
evidence feasibility of alternative management strategies.
5.3.1 Large magnitude of an effect Therefore, unlike many other grading systems, the GRADE approach emphasizes that weak also known
5.3.2. Doseresponse gradient as conditional recommendations in the face of high confidence in effect estimates of an intervention are
5.3.3. Effect of plausible residual confounding common because of these factors other than the quality of evidence influencing the strength of a
5.4 Overall quality of evidence recommendation. For the same reason it allows for strong recommendations on the basis of low or very
6. Going from evidence to recommendations confidence in effect estimates.
6.1 Recommendations and their strength Example 1: Weak recommendation based on high quality evidence
6.1.2 Weak recommendation Several RCTs compared the use of combination chemotherapy and radiotherapy versus radiotherapy
6.1.3 Recommendations to use interventions alone in unresectable, locally advanced nonsmall cell lung cancer (Stage IIIA). The overall quality of
only in research evidence for the body of evidence was rated high. Compared with radiotherapy alone, the combination of
6.1.4 No recommendation chemotherapy and radiotherapy reduces the risk of death corresponding to a mean gain in life expectancy
of a few months, but increases harm and burden related to chemotherapy. Thus, considering the values
and preferences patients would place on the small survival benefit in view of the harms and burdens,
recommendations guideline panels may offer a weak recommendation despite the high quality of the available evidence
6.2.1 Balance of desirable and undesirable (Schünemann et al. AJRCCM 2006).
consequences
6.2.1.1 Estimates of the magnitude of the Example 2: Weak recommendation based on high quality evidence
desirable and undesirable effects Patients who experience a first deep venous thrombosis with no obvious provoking factor must, after the
6.2.1.2 Best estimates of values and first months of anticoagulation, decide whether to continue taking the anticoagulant warfarin long term.
preferences High quality randomized controlled trials show that continuing warfarin will decrease the risk of
6.3.2 Confidence in best estimates of recurrent thrombosis but at the cost of increased risk of bleeding and inconvenience. Because patients
magnitude of effects (quality of evidence) with varying values and preferences will make different choices, guideline panels addressing whether
6.3.3 Confidence in values and preferences patients should continue or terminate warfarin should, despite the high quality evidence, offer a weak
6.3.4 Resource use (cost) recommendation.
6.3.4.1 Differences between costs and Example 3: Strong recommendation based on low or very low quality evidence
other outcomes
6.3.4.2 Perspective The principle of administering appropriate antibiotics rapidly in the setting of severe infection or sepsis
6.3.4.3 Resource implications considered has not been tested against its alternative of no rush of delivering antibiotics in randomized controlled
trials. Yet, guideline panels would be very likely to make a strong recommendation for the rapid use of
6.3.4.4 Confidence in the estimates of
antibiotics in this setting on the basis of available observational studies rated as low quality evidence
resource use (quality of the evidence about because the benefits of antibiotic therapy clearly outweigh the downsides in most patients independent of
cost) the quality assessment (Schünemann et al. AJRCCM 2006)..
6.3.4.6 Economic model
6.3.4.7 Consideration of resource use in
recommendations
6.4 Presentation of recommendations 1.3 Special challenges in applying the the GRADE approach
6.3.2 Symbolic representation
6.4.3 Providing transparent statements about Those applying GRADE to questions about diagnostic tests, public health or health systems will face
assumed values and preferences some special challenges. This handbook will address these challenges and undergo revisions when new
6.5 The EvidencetoDecision framework developments prompt the GRADE working group to agree on changes to the approach. Moreover, there
7. The GRADE approach for diagnostic tests and will be methodological advances and refinements in the future not only of innovations but also of the
strategies established concepts.
7.1.2. Establishing the role of a test
1.4 Modifications to the GRADE approach
GRADE recommends against making modifications to the approach because the elements of the
approach GRADE process are interlinked, because modifications may confuse some users of evidence summaries
1.2 Separation of confidence in effect estimates and guidelines, and because such changes compromise the goal of a single system with which clinicians,
from strength of recommendations policy makers, and patients can become familiar. However, the literature on different approaches to
1.3 Special challenges in applying the the GRADE applying GRADE is growing and are useful to determine when pragmatism is appropriate.
approach
GRADE approach A guideline panel should define the scope of the guideline and the planned recommendations. Each
GRADE approach A guideline panel should define the scope of the guideline and the planned recommendations. Each
3. Selecting and rating the importance of outcomes recommendation should answer a focused and sensible health care question that leads to an action.
3.1 Steps for considering the relative importance of Similarly, authors of systematic reviews should formulate focused health care question(s) that the review
outcomes will answer. A systematic review may answer one or more health care questions, depending on the
3.2 Influence of perspective scope of the review.
3.3 Using evidence in rating the importance of The PICO framework presents a well accepted methodology for framing health care questions. It
outcomes mandates carefully specifying four components:
4. Summarizing the evidence ● Patient: the patients or population to whom the recommendations are meant to apply
4.1 Evidence Tables ● Intervention: the therapeutic, diagnostic, or other intervention under investigation (e.g. the
4.2 GRADE Evidence Profile experimental intervention, or in observational studies the exposure factor)
● Comparison: the alternative intervention; intervention in the control group
5.1 Factors determining the quality of evidence ● Outcome: the outcome(s) of interest
5.1.1 Study design A number of derivatives of this approach exist, for example adding a T for time or S for study design.
5.2 Factors that can reduce the quality of the These modifications are neither helpful nor necessary. The issue of time (e.g. duration of treatment,
evidence when an outcome should be assessed, etc) is covered in the elements by specifying the intervention(s)
5.2.1 Study limitations (Risk of Bias) and outcome(s) appropriately (e.g. mortality at one year). In addition, the studies, and therefore the study
5.2.2 Inconsistency of results design, that inform an answer are often not known when the question is asked. That is, observational
5.2.2.1 Deciding whether to use estimates studies may inform a question when randomized trials are no available or not associated with high
from a subgroup analysis confidence in the estimates. Thus, it is usually not sensible to define a study design beforehand. A
5.2.3 Indirectness of evidence guideline question often involves another specification: the setting in which the guideline will be
5.2.4 Imprecision implemented. For instance, guidelines intended for resourcerich environments will often be inapplicable
5.2.4.1 Imprecision in guidelines to resourcepoor environments. Even the setting, however, can be defined as part of the definition of the
5.2.4.2 Imprecision in in systematic population (e.g. women in low income countries or man with myocardial infarction in a primary or rural
reviews health care setting).
5.2.4.3 Rating down two levels for Errors that are frequently made in formulating the health care question include failure to include all
imprecision patientimportant outcomes (e.g. adverse effects or toxicity), as well as failure to fully consider all
5.2.5 Publication bias relevant alternatives (this may be particularly problematic when guidelines target a global audience).
evidence
5.4 Overall quality of evidence The most challenging decision in framing the question is how broadly the patients and intervention should
6. Going from evidence to recommendations be defined (see Example 1). For the patients and interventions defined, the underlying biology should
6.1 Recommendations and their strength suggest that across the range of patients and interventions it is plausible that the magnitude of effect on
6.1.1 Strong recommendation the key outcomes is more or less the same. If that is not the case the review or guideline will generate
6.1.2 Weak recommendation misleading estimates for at least some subpopulations of patients and interventions. For instance, based
6.1.3 Recommendations to use interventions on the information presented in Example 1, if antiplatelet agents differ in effectiveness in those with
only in research peripheral vascular disease vs. those with myocardial infarction, a single estimate across the range of
6.1.4 No recommendation patients and interventions will not well serve the decisionmaking needs of patients and clinicians. These
6.2 Factors determining direction and strength of subpopulations should, therefore, be defined separately.
recommendations Often, systematic reviews deal with the question of what breadth of population or intervention to choose
6.2.1 Balance of desirable and undesirable by starting with a broad question but including a priori specification of subgroup effects that may explain
consequences any heterogeneity they find. The a priori hypotheses may relate to differences in patients, interventions,
6.2.1.1 Estimates of the magnitude of the the choice of comparator, the outcome(s), or factors related to bias (e.g. high risk of bias studies yield
desirable and undesirable effects different effects than low risk of bias studies).
6.2.1.2 Best estimates of values and Example 1: Deciding how to broadly to define the patients and intervention
preferences
6.3.2 Confidence in best estimates of Addressing the effects of antiplatelet agents on vascular disease, one might include only patients with
transient ischemic attacks, those with ischemic attacks and strokes, or those with any vascular disease
magnitude of effects (quality of evidence)
(cerebro, cardio, or peripheral vascular disease). The intervention might be a relatively narrow range
6.3.3 Confidence in values and preferences of doses of aspirin, all doses of aspirin, or all antiplatelet agents.
6.3.4 Resource use (cost)
6.3.4.1 Differences between costs and Because the relative risk associated with an intervention vs. a specific comparator is usually similar
other outcomes across a wide variety of baseline risks, it is usually appropriate for systematic reviews to generate single
6.3.4.2 Perspective pooled estimates (i.e. metaanalysis) of relative effects across a wide range of patient subgroups.
6.3.4.3 Resource implications considered Recommendations, however, may differ across subgroups of patients at different baseline risk of an
6.3.4.4 Confidence in the estimates of outcome, despite there being a single relative risk that applies to all of them. For instance, the case
for warfarin therapy, associated with both inconvenience and a higher risk of serious bleeding, is much
resource use (quality of the evidence about
stronger in atrial fibrillation patients at substantial vs. minimal risk of stroke. Thus, guideline panels must
cost) often define separate questions (and produce separate evidence summaries) for high and lowrisk
6.3.4.5 Presentation of resource use patients, and patients in whom quality of evidence differs.
recommendations
6.4 Presentation of recommendations 2.2 Dealing with multiple comparators
Another important challenge arises when there are multiple comparators to an intervention. Clarity in
6.4.3 Providing transparent statements about choice of the comparator makes for interpretable guidelines, and lack of clarity can cause confusion.
assumed values and preferences Sometimes, the comparator is obvious, but when it is not guideline panels should specify the comparator
6.5 The EvidencetoDecision framework explicitly. In particular, when multiple agents are involved, they should specify whether the
7. The GRADE approach for diagnostic tests and recommendation is suggesting that all agents are equally recommended or that some agents are
strategies recommended over others (see Example 1).
7.1.1. Establishing the purpose of a test Example 1: Clarity with multiple comparators
7.1.2. Establishing the role of a test When making recommendations for use of anticoagulants in patients with nonST elevation acute
7.1.3. Clear clinical questions coronary syndromes receiving conservative (noninvasive) management, fondaparinux, heparin, and
7.2. Gold standard and reference test enoxaparin may be the agents being considered. Moreover, the estimate of effect for each agent may
come from evidence of varying quality (e.g. high quality evidence for heparin, low quality of evidence
1. Overview of the GRADE Approach for fondaparinux). Therefore, it must be made clear whether the recommendations formulated by the
1.1 Purpose and advantages of the GRADE guideline panel will be for use of these agents vs. not using any anticoagulants, or also whether they will
approach indicate a preference for one agent over the others or a gradient of preference.
1.3 Special challenges in applying the the GRADE 2.3 Other considerations
approach
2. Framing the health care question GRADE has begun to tackle the question of determining the confidence in estimates for prognosis. They
2.1 Defining the patient population and intervention are often important for guideline development. For example, addressing interventions that may influence
2.2 Dealing with multiple comparators the outcome of influenza or multiple sclerosis will require establishing the natural history of the
2.3 Other considerations conditions. This will involve specifying the population (influenza or newonset multiple sclerosis) and
2.4 Format of health care questions using the the outcome (mortality or relapse rate and progression). Such questions of prognosis may be refined to
include multiple predictors, such as age, gender, or severity. The answers to these questions will be an
GRADE approach
important background for formulating recommendations and interpreting the evidence about the effects of
3. Selecting and rating the importance of outcomes treatments. In particular, guideline developers need to decide whether the prognosis of patients in the
3.1 Steps for considering the relative importance of community is similar to those studied in the trials and whether there are important prognostic subgroups
outcomes that they should consider in making recommendations. Judgments if the evidence is direct enough in
3.2 Influence of perspective terms of baseline risk affect the rating about indirectness of evidence.
3.2 Influence of perspective terms of baseline risk affect the rating about indirectness of evidence.
outcomes
4. Summarizing the evidence 2.4 Format of health care questions using the GRADE
4.1 Evidence Tables approach
5. Quality of evidence Defining a health care question includes specifying all outcomes of interest. Those developing
5.1 Factors determining the quality of evidence recommendations whether or not to use a given intervention (therapeutic or diagnostic) have to consider
5.1.1 Study design all relevant outcomes simultaneously. The Guideline Development Tool allows the selection of two
5.2 Factors that can reduce the quality of the different formats for questions about management:
evidence ● Should [intervention] vs. [comparison] be used for [health problem]?
5.2.2 Inconsistency of results ● Should [intervention] vs. [comparison] be used in [population]?
5.2.2.1 Deciding whether to use estimates As well as one format for questions about diagnosis:
from a subgroup analysis ● Should [intervention] vs. [comparison] be used to diagnose [target condition] in [health
5.2.3 Indirectness of evidence problem and/or population]?
5.2.4 Imprecision
5.2.4.1 Imprecision in guidelines Example Questions
5.2.4.2 Imprecision in in systematic 1. Should manual toothbrushes vs. powered toothbrushes be used for dental health?
reviews
5.2.4.3 Rating down two levels for 2. Should topical nasal steroids be used in children with persistent allergic rhinitis?
imprecision 3. Should oseltamivir versus no antiviral treatment be used to treat influenza?
5.2.5 Publication bias 4. Should troponin I followed by appropriate management strategies or troponin T followed by
5.3. Factors that can increase the quality of the appropriate management strategies be used to manage acute myocardial infarction?
evidence
6.1 Recommendations and their strength 3. Selecting and rating the importance of
6.1.2 Weak recommendation outcomes
only in research
Training modules and courses: http://cebgrade.mcmaster.ca/QuestionsAndOutcomes/index.html
6.1.4 No recommendation
6.2 Factors determining direction and strength of Given that recommendations cannot be made on the basis of information about single outcomes and
recommendations decisionmaking always involves a balance between health benefits and harms. Authors of systematic
6.2.1 Balance of desirable and undesirable reviews will make their reviews more useful by looking at a comprehensive range of outcomes that
consequences allow decision making in health care. Many, if not most, systematic reviews fail to address some key
6.2.1.1 Estimates of the magnitude of the outcomes, particularly harms, associated with an intervention.
desirable and undesirable effects On the contrary, to make sensible recommendations guideline panels must consider all outcomes that
6.2.1.2 Best estimates of values and are important or critical to patients for decision making. In addition, they may require consideration of
preferences outcomes that are important to others, including the use of resources paid for by third parties, equity
6.3.2 Confidence in best estimates of considerations, impacts on those who care for patients, and public health impacts (e.g. the spread of
magnitude of effects (quality of evidence) infections or antibiotic resistance).
6.3.3 Confidence in values and preferences Guideline developers must base the choice of outcomes on what is important, not on what outcomes
6.3.4 Resource use (cost) are measured and for which evidence is available. If evidence is lacking for an important outcome, this
6.3.4.1 Differences between costs and should be acknowledged, rather than ignoring the outcome. Because most systematic reviews do not
other outcomes summarize the evidence for all important outcomes, guideline panels must often either use multiple
6.3.4.2 Perspective systematic reviews from different sources, conduct their own systematic reviews or update existing
6.3.4.3 Resource implications considered reviews.
cost)
6.3.4.6 Economic model outcomes
recommendations
6.4 Presentation of recommendations Guideline developers must, and authors of systematic reviews are strongly encouraged to specify all
6.4.1 Wording of recommandations potential patientimportant outcomes as the first step in their endeavour. Guideline developers will also
6.3.2 Symbolic representation make a preliminary classification of the importance of the outcomes. GRADE specifies three
categories of outcomes according to their importance for decisionmaking:
6.4.3 Providing transparent statements about
assumed values and preferences ● critical
6.5 The EvidencetoDecision framework ● important but not critical
7. The GRADE approach for diagnostic tests and
strategies ● of limited importance.
7.1. Questions about diagnostic tests Critical and important outcomes will bear on guideline recommendations, the third will in most situations
7.1.1. Establishing the purpose of a test not. Ranking outcomes by their relative importance can help to focus attention on those outcomes that
7.1.2. Establishing the role of a test are considered most important, and help to resolve or clarify disagreements. Table 3.1 provides an
7.1.3. Clear clinical questions overview of the steps for considering the relative importance of outcomes.
7.2. Gold standard and reference test Guideline developers should first consider whether particular health benefits and harms of a therapy
are important to the decision regarding the optimal management strategy, or whether they are of limited
1. Overview of the GRADE Approach importance. If the guideline panel thinks that a particular outcome is important, then it should consider
1.1 Purpose and advantages of the GRADE whether the outcome is critical to the decision, or only important, but not critical.
approach To facilitate ranking of outcomes according to their importance guideline developers may choose to rate
1.2 Separation of confidence in effect estimates outcomes numerically on a 1 to 9 scale (7 to 9 – critical; 4 to 6 – important; 1 to 3 – of limited
from strength of recommendations importance) to distinguish between importance categories.
approach Practically, to generate a list of relevant outcomes, one can use the following type of scales.
2. Framing the health care question rating scale:
2.1 Defining the patient population and intervention 1 2 3 4 5 6 7 8 9
of least of most
2.3 Other considerations importance importance
2.4 Format of health care questions using the of limited importance important, but not critical Critical
GRADE approach for making a decision for making a decision for making a decision
(not included in evidence profile) (included in evidence profile) (included in evidence profile)
outcomes The first step of a classification of importance of outcomes should occur during protocol of a systematic
3.2 Influence of perspective review or when the panel agrees on the health care questions that should be addressed in a guideline.
3.3 Using evidence in rating the importance of Thus, it should be done before a protocol is developed. When evidence becomes available a
reassessment of importance may be necessary to ensure that important outcomes identified by reviews of
outcomes
the evidence that were not initially considered are included and to reconsider the relative importance of
3.4 Surrogate (substitute) outcomes outcomes in light of the available evidence which will be influenced by the relative importance of the
outcomes in light of the available evidence which will be influenced by the relative importance of the
4. Summarizing the evidence outcome. It is possible that there is no association between the outcome and the intervention of interest
4.1 Evidence Tables which supports to not consider that outcome further.
4.3 Summary of Findings table Guideline panels should be aware of the possibility that in some instances the importance of an outcome
5. Quality of evidence (e.g. a serious adverse effect) may only become known after the protocol is written, evidence is
5.1 Factors determining the quality of evidence reviewed or the analyses were carried out, and should take appropriate actions to include these in the
evidence tables.
5.1.1 Study design
5.2 Factors that can reduce the quality of the Example 1: Hierarchy of outcomes according to their importance to assess the effect of oseltamivir in
evidence patients with H5N1 influenza. Mortality in patients affected with H5N1 is as high as 50%. Patient are
5.2.1 Study limitations (Risk of Bias) usually affected by severe respiratory compromise and require ventilatory support. Complications of a
5.2.2 Inconsistency of results potentially useful medication, oseltamivir, are suspected to be of temporary neurological nature, other
5.2.2.1 Deciding whether to use estimates adverse effects such as nausea also occur during treatment.
5.2.4 Imprecision
reviews
imprecision
evidence
only in research
recommendations
consequences
6.2.1.1 Estimates of the magnitude of the
preferences
6.3.2 Confidence in best estimates of Example 2. Hierarchy of outcomes according to their importance to assess the effect of phosphate
magnitude of effects (quality of evidence) lowering drugs in patients with renal failure and hyperphosphatemia
6.3.3 Confidence in values and preferences
6.3.4.1 Differences between costs and
other outcomes
6.3.4.2 Perspective
cost)
recommendations
6.5 The EvidencetoDecision framework
strategies
7.1. Questions about diagnostic tests Example 3: Reassessment of the relative importance of outcomes
7.1.2. Establishing the role of a test Consider, for instance, a screening intervention, such as screening for aortic abdominal aneurysm.
7.1.3. Clear clinical questions Initially, a guideline panel is likely to consider the intervention’s impact on allcause mortality as
7.2. Gold standard and reference test critical. Let us say, however, that the evidence summary establishes an important reduction in cause
specific mortality from abdominal aortic aneurysm but fails to definitively establish a reduction in all
1. Overview of the GRADE Approach cause mortality. The reduction in causespecific mortality may be judged sufficiently compelling that,
1.1 Purpose and advantages of the GRADE even in the absence of a demonstrated reduction in allcause mortality (which may be undetected
approach because of random error from other causes of death), the screening intervention is clearly worthwhile.
1.2 Separation of confidence in effect estimates Allcause mortality then becomes less relevant and ceases to be a critical outcome.
from strength of recommendations The relative importance of outcomes should be considered when determining the overall quality of
1.3 Special challenges in applying the the GRADE evidence, which may depend on which outcomes are ranked as critical or important (see Chapter Quality
approach of evidence), and judging the balance between the health benefits and harms of an intervention when
1.4 Modifications to the GRADE approach formulating the recommendations (see Chapter Going from evidence to recommendations)
2. Framing the health care question Only outcomes considered critical (rated 79) are the primary factors influencing a recommendation and
2.1 Defining the patient population and intervention will be used to determine the overall quality of evidence supporting a recommendation.
2.4 Format of health care questions using the Table 3.1: Steps for considering the relative importance of outcomes
GRADE approach
Step What Why How Evidence
3.1 Steps for considering the relative importance of 1 Preliminary To focus attention Conducting a These judgments
outcomes classification of on those outcomes systematic review are ideally
3.2 Influence of perspective outcomes as that are considered of the relevant informed by a
3.3 Using evidence in rating the importance of critical, important most important literature. By systematic review
outcomes but not critical, or when searching for asking panel of the literature
3.4 Surrogate (substitute) outcomes low importance, and summarizing members and focusing on what
4. Summarizing the evidence before reviewing the evidence and to possibly patients or the target
4.1 Evidence Tables the evidence resolve or clarify members of the population
disagreements. public to identify considers as
important critical or
4.3 Summary of Findings table outcomes, judging important
outcomes, judging important
5. Quality of evidence the relative outcomes for
5.1 Factors determining the quality of evidence importance of the decision making.
5.1.1 Study design outcomes and Literature about
5.2 Factors that can reduce the quality of the discussing values,
evidence disagreements. preferences or
5.2.1 Study limitations (Risk of Bias) utilities is often
5.2.2 Inconsistency of results used in these
5.2.2.1 Deciding whether to use estimates reviews, that
from a subgroup analysis should be
5.2.3 Indirectness of evidence systematic in
5.2.4 Imprecision nature.
5.2.4.1 Imprecision in guidelines Alternatively the
5.2.4.2 Imprecision in in systematic collective
reviews experience of the
panel members,
patients, and
imprecision members of the
5.2.5 Publication bias public can be used
5.3. Factors that can increase the quality of the using transparent
evidence methods for
5.3.1 Large magnitude of an effect documenting and
5.3.2. Doseresponse gradient considering them
5.3.3. Effect of plausible residual confounding (see Santesso N et
5.4 Overall quality of evidence al, IJOBGYN
6. Going from evidence to recommendations 2012). Prior
6.1 Recommendations and their strength knowledge of the
6.1.1 Strong recommendation research evidence
6.1.2 Weak recommendation or, ideally, a
6.1.3 Recommendations to use interventions systematic review
only in research of that evidence is
6.1.4 No recommendation likely to be
helpful.
recommendations 2 Reassessment of To ensure that By asking the panel Experience of the
6.2.1 Balance of desirable and undesirable the relative important outcomes members (and, if panel members
consequences importance of identified by relevant, patients and other
6.2.1.1 Estimates of the magnitude of the outcomes after reviews of the and members of the informants and
desirable and undesirable effects reviewing the evidence that were public) to systematic
6.2.1.2 Best estimates of values and evidence not initially reconsider the reviews of the
preferences considered are relative importance effects of the
6.3.2 Confidence in best estimates of included and to of the outcomes intervention
reconsider the included in the first
relative importance step and any
6.3.3 Confidence in values and preferences of outcomes in light additional
6.3.4 Resource use (cost) of the available outcomes identified
6.3.4.1 Differences between costs and evidence by reviews of the
other outcomes evidence
6.3.4.2 Perspective
6.3.4.3 Resource implications considered 3 Judging the To support making By asking the panel Experience of the
6.3.4.4 Confidence in the estimates of balance between a recommendation members to panel members
resource use (quality of the evidence about the desirable and and to determine balance the and other
undesirable health the strength of the desirable and informants,
cost)
outcomes of an recommendation undesirable health systematic
6.3.4.5 Presentation of resource use intervention outcomes using an reviews of the
6.3.4.6 Economic model evidence to effects of the
6.3.4.7 Consideration of resource use in recommendation intervention,
recommendations framework that evidence of the
6.4 Presentation of recommendations includes a value that the
6.4.1 Wording of recommandations summary of target population
6.3.2 Symbolic representation findings table or attach to key
6.4.3 Providing transparent statements about evidence profile outcomes (if
assumed values and preferences and, if relevant, relevant and
6.5 The EvidencetoDecision framework based on a decision available) and
7. The GRADE approach for diagnostic tests and analysis decision analysis
strategies or economic
7.1. Questions about diagnostic tests analyses (if
7.1.1. Establishing the purpose of a test relevant and
7.1.2. Establishing the role of a test available)
1. Overview of the GRADE Approach 3.2 Influence of perspective

approach
1.2 Separation of confidence in effect estimates The importance of outcomes is likely to vary within and across cultures or when considered from
from strength of recommendations the perspective of the target population (e.g. patients or the public), clinicians or policymakers.
1.3 Special challenges in applying the the GRADE Cultural diversity will often influence the relative importance of outcomes, particularly when developing
approach recommendations for an international audience.
1.4 Modifications to the GRADE approach Guideline panels must decide what perspective they are taking. Although different panels may elect to
2. Framing the health care question take different perspectives (e.g. that of individual patients or a health systems perspective), the relative
2.1 Defining the patient population and intervention importance given to health outcomes should reflect the perspective of those who are affected. When the
2.2 Dealing with multiple comparators target audiences for a guideline are clinicians and the patients they treat, the perspective would
2.3 Other considerations generally be that of the patient. (see Chapter Going from evidence to recommendations that addresses
2.4 Format of health care questions using the the issue of perspective from the point of view of resource use)
GRADE approach
3.1 Steps for considering the relative importance of 3.3 Using evidence in rating the importance of outcomes
outcomes
3.3 Using evidence in rating the importance of Guideline developers will ideally review evidence, or conduct a systematic review of the evidence,
outcomes relating to patients’ values and preferences about the intervention in question in order to inform the rating
3.4 Surrogate (substitute) outcomes of the importance of outcomes. Reviewing the evidence may provide the panel with insight about the
4. Summarizing the evidence variability in patients’ values, the patient experience of burden or side effects, and the weighing of
4.1 Evidence Tables desirable versus undesirable outcomes.
4.2 GRADE Evidence Profile In the absence of such evidence, panel members should use their prior experiences with the target
4.3 Summary of Findings table population to assume the relevant values and preferences.
5.1.1 Study design
evidence
5.2.1 Study limitations (Risk of Bias) Not infrequently, outcomes of most importance to patients remain unexplored. When important outcomes
5.2.2 Inconsistency of results are relatively infrequent, or occur over long periods of time, investigators often choose to measure
5.2.2.1 Deciding whether to use estimates substitutes, or surrogates, for those outcomes.
Guideline developers should consider surrogate outcomes only when evidence about population
5.2.3 Indirectness of evidence important outcomes is lacking. When this is the case, they should specify the populationimportant
5.2.4 Imprecision outcomes and, if necessary, the surrogates they are using to substitute for those important outcomes.
5.2.4.1 Imprecision in guidelines Guideline developers should not list the surrogates themselves as their measures of outcome. The
5.2.4.2 Imprecision in in systematic necessity to substitute the surrogate may ultimately lead to rating down the quality of the evidence
reviews because of the indirectness (see Chapter Quality of evidence).
imprecision Outcomes selected by the guideline panel should be included in an evidence profile whether or not
information about them is available (see Chapter Summarizing the evidence), that is an empty row in
an evidence profile can be informative in that it identifies research gaps.
evidence
5.4 Overall quality of evidence 4. Summarizing the evidence
6.1 Recommendations and their strength A guideline panel should base its recommendation on the best available body of evidence related to the
6.1.1 Strong recommendation health care question. A guideline panel can use already existing high quality systematic reviews or
6.1.2 Weak recommendation conduct its own systematic review depending on the specific circumstances such as availability of high
6.1.3 Recommendations to use interventions quality systematic reviews and resources, but GRADE recommends that systematic reviews should form
only in research the basis for making health care recommendations. One should seek evidence relating to all patient
6.1.4 No recommendation important outcomes and for the values patients place on these outcomes as well as related management
6.2 Factors determining direction and strength of options.
recommendations
The endpoint for systematic reviews and for HTA restricted to evidence reports is a summary of the
6.2.1 Balance of desirable and undesirable evidence, the quality rating for each outcome and the estimate of effect. For guideline developers and
consequences HTA that provide advice to policymakers, a summary of the evidence represents a key milestone on the
6.2.1.1 Estimates of the magnitude of the path to a recommendation. The evidence collected from systematic reviews is used to produce GRADE
desirable and undesirable effects evidence profile and summary of findings table.
preferences
magnitude of effects (quality of evidence) 4.1 Evidence Tables
6.3.4 Resource use (cost) An evidence table is a key tool in the presentation of evidence and the corresponding results. Evidence
6.3.4.1 Differences between costs and tables are a method for presenting the quality of the available evidence, the judgments that bear on the
other outcomes quality rating, and the effects of alternative management strategies on the outcomes of interest.
6.3.4.2 Perspective
6.3.4.3 Resource implications considered Clinicians, patients, the public, guideline developers, and policymakers require succinct and transparent
6.3.4.4 Confidence in the estimates of evidence summaries to support their decisions. While an unambiguous health care question is key to
evidence summaries, the requirements for specific users may differ in content and detail. Therefore, the
format of each table may be different depending on user needs.
cost)
6.3.4.5 Presentation of resource use Two approaches (with iterations) for evidence tables are available, which serve different purposes and
6.3.4.6 Economic model are intended for different audiences:
6.3.4.7 Consideration of resource use in ● (GRADE) evidence profile
recommendations
6.4 Presentation of recommendations ● Summary of Findings (SoF) table
6.4.1 Wording of recommandations The Guideline Development Tool facilitates the production of both Evidence Profiles and SoF tables.
6.3.2 Symbolic representation After completing the information to populate the tables, the information will be stored and can be
6.4.3 Providing transparent statements about updated accordingly. Different formats for each aproach, chosen according to what the target audience
assumed values and preferences may prefer, are available.
6.5 The EvidencetoDecision framework Outcomes considered important (rated 46) or critical (rated 79) for decisionmaking should be
7. The GRADE approach for diagnostic tests and included in the evidence profile and SoF table.
strategies
7.1.2. Establishing the role of a test 4.2 GRADE Evidence Profile
7.2. Gold standard and reference test See online tutorials at: cebgrade.mcmaster.ca
The GRADE evidence profile contains detailed information about the quality of evidence assessment
1. Overview of the GRADE Approach and the summary of findings for each of the included outcomes. It is intended for review authors, those
1.1 Purpose and advantages of the GRADE preparing SoF tables and anyone who questions a quality assessment. It helps those preparing SoF tables
approach to ensure that the judgments they make are systematic and transparent and it allows others to inspect
1.2 Separation of confidence in effect estimates those judgments. Guideline panels should use evidence profiles to ensure that they agree about the
from strength of recommendations judgments underlying the quality assessments.
A GRADE evidence profile allows presentation of key information about all relevant outcomes for a
approach given health care question. It presents information about the body of evidence (e.g. number of studies),
1.4 Modifications to the GRADE approach the judgments about the underlying quality of evidence, key statistical results, and the quality of
2. Framing the health care question evidence rating for each outcome.
2.2 Dealing with multiple comparators A GRADE evidence profile is particularly useful for presentation of evidence supporting a
2.3 Other considerations recommendation in clinical practice guidelines but also as summary of evidence for other purposes
2.4 Format of health care questions using the where users need or want to understand the judgments about the quality of evidence in more detail.
GRADE approach The standard format for the evidence profile includes:
3. Selecting and rating the importance of outcomes ● A list of the outcomes
outcomes ● The number of studies and study design(s)
3.2 Influence of perspective ● Judgements about each of the quality of evidence factors assessed; risk of bias,
3.3 Using evidence in rating the importance of inconsistency, indirectness, imprecision, other considerations (including publication bias and
outcomes factors that increase the quality of evidence)
● The assumed risk; a measure of the typical burden of the outcomes, i.e. illustrative risk or
also called baseline risk, baseline score, or control group risk
4.1 Evidence Tables
4.2 GRADE Evidence Profile ● The corresponding risk; a measure of the burden of the outcomes after the intervention is
4.3 Summary of Findings table applied, i.e. the risk of an outcome in treated/exposed people based on the relative magnitude of
5. Quality of evidence an effect and assumed (baseline) risk
5.1 Factors determining the quality of evidence ● The relative effect; for dichotomous outcomes the table will usually provide risk ratio, odds
5.1.1 Study design ratio, or hazard ratio
● The absolute effect; for dichotomous outcomes the number of fewer or more events in
evidence
treated/exposed group as compared to the control group
5.2.2 Inconsistency of results ● Rating of the overall quality of evidence for each outcome (which may vary by outcome)
5.2.2.1 Deciding whether to use estimates ● Classification of the importance of each outcome
5.2.2.1 Deciding whether to use estimates ● Classification of the importance of each outcome
5.2.3 Indirectness of evidence ● Footnotes, if needed, to provide explanations about information in the table such as
5.2.4 Imprecision elaboration on judgements about the quality of evidence
5.2.4.1 Imprecision in guidelines Example 1: GRADE Evidence Profile
[INSERT IMAGE]
reviews
imprecision
5.2.5 Publication bias 4.3 Summary of Findings table
evidence
5.3.1 Large magnitude of an effect Summary of Findings tables provide a summary of findings for each of the included outcomes and the
quality of evidence rating for each outcome in a quick and accessible format, without details of the
judgements about the quality of evidence. They are intended for a broader audience, including end users
5.3.3. Effect of plausible residual confounding of systematic reviews and guidelines. They provide a concise summary of the key information that is
5.4 Overall quality of evidence needed by someone making a decision and, in the context of a guideline, provide a summary of the key
6. Going from evidence to recommendations information underlying a recommendation
6.1.1 Strong recommendation The format of SoF tables produced using the Guideline Development Tool has been refined over the past
6.1.2 Weak recommendation several years through wide consultation, user testing, and evaluation. It is designed to support the optimal
6.1.3 Recommendations to use interventions presentation of the key findings of systematic reviews. The SoF table format has been developed with
the aim of ensuring consistency and ease of use across reviews, inclusion of the most important
only in research
information needed by decision makers, and optimal presentation of this information. However, there
6.1.4 No recommendation may be good reasons for modifying the format of a SoF table for some reviews.
recommendations The standard format for the SoF table includes:
6.2.1 Balance of desirable and undesirable ● A list of the outcomes
consequences
6.2.1.1 Estimates of the magnitude of the ● The assumed risk; a measure of the typical burden of the outcomes, i.e. illustrative risk or
desirable and undesirable effects also called baseline risk, baseline score, or control group risk
6.2.1.2 Best estimates of values and ● The corresponding risk; a measure of the burden of the outcomes after the intervention is
preferences applied, i.e. the risk of an outcome in treated/exposed people based on the relative magnitude of
6.3.2 Confidence in best estimates of an effect and assumed (baseline) risk
magnitude of effects (quality of evidence) ● The relative effect; for dichotomous outcomes the table will usually provide risk ratio, odds
6.3.3 Confidence in values and preferences ratio, or hazard ratio
6.3.4.1 Differences between costs and ● The number of participants and the number of studies and their designs
other outcomes ● Rating of the overall quality of evidence for each outcome (which may vary by outcome)
6.3.4.2 Perspective
● Footnotes or explanations, if needed, to provide explanations about information in the table
6.3.4.4 Confidence in the estimates of ● Comments (if needed)
resource use (quality of the evidence about Systematic reviews that address more than one main comparison (e.g. examining the effects of a number
cost) of interventions) will require separate SoF tables for each comparison. Moreover, for each
6.3.4.5 Presentation of resource use comparison of alternative management strategies, all outcomes should be presented together in one
6.3.4.6 Economic model evidence profile or SoF table. It is likely that all studies relevant to a health care question will not
6.3.4.7 Consideration of resource use in provide evidence regarding every outcome. Indeed, there may be no overlap between studies providing
recommendations evidence for one outcome and those providing evidence for another. Because most existing systematic
6.4 Presentation of recommendations reviews do not adequately address all relevant outcomes, the GRADE process may require relying on
6.4.1 Wording of recommandations more than one systematic review.
6.3.2 Symbolic representation Example 2: GRADE Summary of Findings Table
assumed values and preferences [INSERT IMAGE]
strategies

approach
approach
GRADE approach
outcomes
outcomes
4.1 Evidence Tables
5.1.1 Study design
evidence
5.2.4 Imprecision
reviews
imprecision
evidence
6.1.2 Weak recommendation GRADE provides a specific definition of the quality of evidence that is different in the context of
6.1.3 Recommendations to use interventions making recommendations and in the context of summarizing the findings of a systematic review.
only in research As GRADE suggests somewhat different approaches for rating the quality of evidence for systematic
6.1.4 No recommendation reviews and for guidelines, the handbook highlights guidance that is specific to each group. HTA
6.2 Factors determining direction and strength of practitioners, depending on their mandate, can decide which approach is more suitable for their goals.
recommendations
6.2.1 Balance of desirable and undesirable For guideline panels:
consequences The quality of evidence reflects the extent to which our confidence in an estimate of the effect is
6.2.1.1 Estimates of the magnitude of the adequate to support a particular recommendation.
Guideline panels must make judgments about the quality of evidence relative to the specific context for
6.2.1.2 Best estimates of values and which they are using the evidence.
preferences
6.3.2 Confidence in best estimates of The GRADE approach involves separate grading of quality of evidence for each patientimportant
magnitude of effects (quality of evidence) outcome followed by determining an overall quality of evidence across outcomes.
6.3.3 Confidence in values and preferences For authors of systematic reviews:
6.3.4.1 Differences between costs and The quality of evidence reflects the extent to which we are confident that an estimate of the effect is
correct.
other outcomes
6.3.4.2 Perspective Because systematic reviews do not, or at least should not, make recommendations, they require a
6.3.4.3 Resource implications considered different definition. Authors of systematic reviews grade quality of a body of evidence separately for
6.3.4.4 Confidence in the estimates of each patientimportant outcome.
resource use (quality of the evidence about The quality of evidence is rated for each outcome across studies (i.e. for a body of evidence). This does
cost) not mean rating each study as a single unit. Rather, GRADE is “outcome centric”; rating is done for
6.3.4.5 Presentation of resource use each outcome, and quality may differ indeed, is likely to differ from one outcome to another within a
6.3.4.6 Economic model single study and across a body of evidence.
Example 1: Quality of evidence may differ from one outcome to another within a single study
recommendations
6.4 Presentation of recommendations In a series of unblinded RCTs measuring both the occurrence of stroke and allcause mortality, it is
6.4.1 Wording of recommandations possible that stroke much more vulnerable to biased judgments will be rated down for risk of bias,
6.3.2 Symbolic representation whereas allcause mortality will not. Similarly, a series of studies in which very few patients are lost to
6.4.3 Providing transparent statements about followup for the outcome of death, and very many for the outcome of quality of life, is likely to result in
assumed values and preferences judgments of lower quality for the latter outcome. Problems with indirectness may lead to rating down
6.5 The EvidencetoDecision framework quality for one outcome and not another within a study or studies if, for example, fracture rates are
measured using a surrogate (e.g. bone mineral density) but side effects are measured directly.
strategies Although the quality of evidence represents a continuum, the GRADE approach results in an assessment
7.1. Questions about diagnostic tests of the quality of a body of evidence in one of four grades:
7.1.1. Establishing the purpose of a test Table 5.1: Quality of Evidence Grades
7.2. Gold standard and reference test Grade Definition
High We are very confident that the true effect lies close to that of the estimate of
1. Overview of the GRADE Approach the effect.
approach Moderate We are moderately confident in the effect estimate: The true effect is likely to
1.2 Separation of confidence in effect estimates be close to the estimate of the effect, but there is a possibility that it is
from strength of recommendations substantially different
1.3 Special challenges in applying the the GRADE Low Our confidence in the effect estimate is limited: The true effect may be
approach substantially different from the estimate of the effect.
2. Framing the health care question Very Low We have very little confidence in the effect estimate: The true effect is likely
2.1 Defining the patient population and intervention to be substantially different from the estimate of effect
2.2 Dealing with multiple comparators Quality of evidence is a continuum; any discrete categorisation involves some degree of arbitrariness.
2.3 Other considerations Nevertheless, advantages of simplicity, transparency, and vividness outweigh these limitations.
GRADE approach
outcomes
3.2 Influence of perspective The GRADE approach to rating the quality of evidence begins with the study design (trials or
3.3 Using evidence in rating the importance of observational studies) and then addresses five reasons to possibly rate down the quality of evidence and
outcomes three to possibly rate up the quality. The subsequent sections of the handbook will address each of the
3.4 Surrogate (substitute) outcomes factors in detail.
4. Summarizing the evidence Table 5.2: Factors that can reduce the quality of the evidence
4.1 Evidence Tables
4.2 GRADE Evidence Profile Factor Consequence
4.3 Summary of Findings table Limitations in study design or execution (risk of ↓ 1 or 2 levels
5. Quality of evidence bias)
5.1.1 Study design Inconsistency of results ↓ 1 or 2 levels
5.2 Factors that can reduce the quality of the Indirectness of evidence ↓ 1 or 2 levels
evidence
5.2.1 Study limitations (Risk of Bias) Imprecision ↓ 1 or 2 levels
5.2.2 Inconsistency of results Publication bias ↓ 1 or 2 levels
5.2.4 Imprecision Table 5.3: Factors that can increase the quality of the evidence
5.2.4.1 Imprecision in guidelines Factor Consequence
reviews Large magnitude of effect ↑ 1 or 2 levels
5.2.4.3 Rating down two levels for All plausible confounding would reduce the ↑ 1 level
imprecision demonstrated effect or increase the effect if no
imprecision demonstrated effect or increase the effect if no
5.2.5 Publication bias effect was observed
Doseresponse gradient ↑ 1 level
evidence
5.3.2. Doseresponse gradient While factors influencing the quality of evidence are additive – such that the reduction or increase in
5.3.3. Effect of plausible residual confounding each individual factor is added together with the other factors to reduce or increase the quality of
5.4 Overall quality of evidence evidence for an outcome – grading the quality of evidence involves judgements which are not exclusive.
6. Going from evidence to recommendations Therefore, GRADE is not a quantitative system for grading the quality of evidence. Each factor for
6.1 Recommendations and their strength downgrading or upgrading reflects not discrete categories but a continuum within each category and
6.1.1 Strong recommendation among the categories. When the body of evidence is intermediate with respect to a particular factor, the
6.1.2 Weak recommendation decision about whether a study falls above or below the threshold for up or downgrading the quality (by
6.1.3 Recommendations to use interventions one or more factors) depends on judgment.
only in research For example, if there was some uncertainty about the three factors: study limitations, inconsistency, and
6.1.4 No recommendation imprecision, but not serious enough to downgrade each of them, one could reasonably make the case for
6.2 Factors determining direction and strength of downgrading, or for not doing so. A reviewer might in each category give the studies the benefit of the
recommendations doubt and would interpret the evidence as high quality. Another reviewer, deciding to rate down the
6.2.1 Balance of desirable and undesirable evidence by one level, would judge the evidence as moderate quality. Reviewers should grade the
consequences quality of the evidence by considering both the individual factors in the context of other judgments they
6.2.1.1 Estimates of the magnitude of the made about the quality of evidence for the same outcome.
desirable and undesirable effects In such a case, you should pick one or two categories of limitations which you would offer as reasons for
6.2.1.2 Best estimates of values and downgrading and explain your choice in the footnote. You should also provide a footnote next to the other
preferences factor, you decided not to downgrade, explaining that there was some uncertainty, but you already
6.3.2 Confidence in best estimates of downgraded for the other factor and further lowering the quality of evidence for this outcome would
magnitude of effects (quality of evidence) seem inappropriate. GRADE strongly encourages review and guideline authors to be explicit and
6.3.3 Confidence in values and preferences transparent when they find themselves in these situations by acknowledging borderline decisions.
6.3.4 Resource use (cost) Despite the limitations of breaking continua into categories, treating each criterion for rating quality up
6.3.4.1 Differences between costs and or down as discrete categories enhances transparency. Indeed, the great merit of GRADE is not that it
other outcomes ensures reproducible judgments but that it requires explicit judgment that is made transparent to
6.3.4.2 Perspective users.
resource use (quality of the evidence about 5.1.1 Study design
cost)
Study design is critical to judgments about the quality of evidence.
6.3.4.7 Consideration of resource use in For recommendations regarding management strategies – as opposed to establishing prognosis or the
recommendations accuracy of diagnostic tests – randomized trials provide, in general, far stronger evidence than
6.4 Presentation of recommendations observational studies, and rigorous observational studies provide stronger evidence than uncontrolled
6.4.1 Wording of recommandations case series.
6.3.2 Symbolic representation In the GRADE approach to quality of evidence:
● randomized trials without important limitations provide high quality evidence
6.5 The EvidencetoDecision framework ● observational studies without special strengths or important limitations provide low
7. The GRADE approach for diagnostic tests and quality evidence
strategies Limitations or special strengths can, however, modify the quality of the evidence of both randomized
7.1. Questions about diagnostic tests trials and observational studies.
7.1.2. Establishing the role of a test Note:
7.2. Gold standard and reference test Nonrandomised experimental trials (quasiRCT) without important limitations also provide high
quality evidence, but will automatically be downgraded for limitations in design (risk of bias) – such as
1. Overview of the GRADE Approach lack of concealment of allocation and tie with a provider (e.g. chart number).
1.1 Purpose and advantages of the GRADE Case series and case reports are observational studies that investigate only patients exposed to the
approach intervention. Source of control group results is implicit or unclear, thus, they will usually warrant
1.2 Separation of confidence in effect estimates downgrading from low to very low quality evidence.
Expert opinion is not a category of quality of evidence. Expert opinion represents an interpretation of
evidence in the context of experts' experiences and knowledge. Experts may have opinion about
approach evidence that may be based on interpretation of studies ranging from uncontrolled case series (e.g.
1.4 Modifications to the GRADE approach observations in expert’s own practice) to randomized trials and systematic reviews known to the expert.
2. Framing the health care question It is important to describe what type of evidence (whether published or unpublished) is being used as the
2.1 Defining the patient population and intervention basis for interpretation.
GRADE approach 5.2 Factors that can reduce the quality of the evidence
3.1 Steps for considering the relative importance of The following sections discuss in detail the 5 factors that can result in rating down the quality of
outcomes evidence for specific outcomes and, thereby, reduce confidence in the estimate of the effect.
outcomes 5.2.1 Study limitations (Risk of Bias)
Limitations in the study design and execution may bias the estimates of the treatment effect. Our
4.1 Evidence Tables
confidence in the estimate of the effect and in the following recommendation decreases if studies suffer
4.2 GRADE Evidence Profile from major limitations. The more serious the limitations are, the more likely it is that the quality of
4.3 Summary of Findings table evidence will be downgraded. Numerous tools exist to evaluate the risk of bias in randomized trials and
5. Quality of evidence observational studies. This handbook describes the key criteria used in the GRADE approach.
5.1.1 Study design Our confidence in an estimate of effect decreases if studies suffer from major limitations that are likely
5.2 Factors that can reduce the quality of the to result in a biased assessment of the intervention effect. For randomized trials, the limitations outlined
evidence in Table 5.4are likely to result in biased result.
5.2.1 Study limitations (Risk of Bias) Table 5.4: Study limitations in randomized controlled trials
Explanation
from a subgroup analysis Lack of allocation concealment Those enrolling patients are aware of the group
5.2.3 Indirectness of evidence (or period in a crossover trial) to which the next
5.2.4 Imprecision enrolled patient will be allocated (a major
5.2.4.1 Imprecision in guidelines problem in “pseudo” or “quasi” randomized trials
5.2.4.2 Imprecision in in systematic with allocation by day of week, birth date, chart
reviews number, etc.).
5.2.4.3 Rating down two levels for Lack of blinding Patient, caregivers, those recording outcomes,
imprecision those adjudicating outcomes, or data analysts are
5.2.5 Publication bias aware of the arm to which patients are allocated
5.3. Factors that can increase the quality of the (or the medication currently being received in a
evidence crossover trial).
5.3.1 Large magnitude of an effect Incomplete accounting of patients and outcome Loss to followup and failure to adhere to the
5.3.1 Large magnitude of an effect Incomplete accounting of patients and outcome Loss to followup and failure to adhere to the
5.3.2. Doseresponse gradient events intentiontotreat principle in superiority trials; or
5.3.3. Effect of plausible residual confounding in noninferiority trials, loss to followup, and
5.4 Overall quality of evidence failure to conduct both analyses considering only
6. Going from evidence to recommendations those who adhered to treatment, and all patients
6.1 Recommendations and their strength for whom outcome data are available.
The significance of particular rates of loss to
followup, however, varies widely and is
6.1.3 Recommendations to use interventions dependent on the relation between loss to follow
only in research up and number of events. The higher the
6.1.4 No recommendation proportion lost to followup in relation to
6.2 Factors determining direction and strength of intervention and control group event rates, and
recommendations differences between intervention and control
6.2.1 Balance of desirable and undesirable groups, the greater the threat of bias.
consequences
6.2.1.1 Estimates of the magnitude of the Selective outcome reporting Incomplete or absent reporting of some outcomes
and not others on the basis of the results.
6.2.1.2 Best estimates of values and Other limitations ● Stopping trial early for benefit.
preferences Substantial overestimates are likely in
6.3.2 Confidence in best estimates of trials with fewer than 500 events and that
magnitude of effects (quality of evidence) large overestimates are likely in trials
6.3.3 Confidence in values and preferences with fewer than 200 events. Empirical
6.3.4 Resource use (cost) evidence suggests that formal stopping
6.3.4.1 Differences between costs and rules do not reduce this bias.
other outcomes ● Use of unvalidated outcome measures
6.3.4.2 Perspective (e.g. patientreported outcomes)
6.3.4.3 Resource implications considered ● Carryover effects in crossover trial
resource use (quality of the evidence about ● Recruitment bias in clusterrandomized
cost) trials
6.3.4.5 Presentation of resource use Systematic reviews of tools to assess the methodological quality of nonrandomized studies have
6.3.4.6 Economic model identified over 200 checklists and instruments. We summarize in Table 5.5 the key criteria for
6.3.4.7 Consideration of resource use in observational studies that reflect the contents of these checklists.
recommendations
6.4 Presentation of recommendations Table 5.5: Study limitations in observational studies
6.4.1 Wording of recommandations Explanation
Failure to develop and apply appropriate eligibility ● Under or overmatching in case
criteria (inclusion of control population) control studies
6.5 The EvidencetoDecision framework ● Selection of exposed and unexposed in
7. The GRADE approach for diagnostic tests and cohort studies from different populations
strategies Flawed measurement of both exposure and ● Differences in measurement of
7.1. Questions about diagnostic tests outcome exposure (e.g. recall bias in casecontrol
7.1.1. Establishing the purpose of a test studies)
7.2. Gold standard and reference test ● Differential surveillance for outcome
in exposed and unexposed in cohort
1. Overview of the GRADE Approach studies
1.1 Purpose and advantages of the GRADE Failure to adequately control confounding ● Failure of accurate measurement of all
approach known prognostic factors
from strength of recommendations ● Failure to match for prognostic factors
1.3 Special challenges in applying the the GRADE and/or adjustment in statistical analysis
approach Incomplete or inadequately short followup Especially within prospective cohort studies, both
1.4 Modifications to the GRADE approach groups should be followed for the same amount
2. Framing the health care question of time.
2.1 Defining the patient population and intervention Depending on the context and study type, there can be additional limitations than those listed above.
2.2 Dealing with multiple comparators Guideline panels and authors of systematic reviews should consider all possible limitations.
2.4 Format of health care questions using the Guideline panels or authors of systematic reviews should consider the extent to which study limitations
GRADE approach may bias the results (see Examples 1 to 7). If the limitations are serious they may downgrade the quality
3. Selecting and rating the importance of outcomes rating by one or even two levels. Moving from risk of bias criteria for each individual study to a
3.1 Steps for considering the relative importance of judgment about rating down for quality of evidence for risk of bias across a group of studies addressing a
particular outcome presents challenges. We suggest the following principles:
outcomes
3.2 Influence of perspective 1. In deciding on the overall quality of evidence, one does not average across studies (for
3.3 Using evidence in rating the importance of instance if some studies have no serious limitations, some serious limitations, and some very
outcomes serious limitations, one does not automatically rate quality down by one level because of an
3.4 Surrogate (substitute) outcomes average rating of serious limitations). Rather, judicious consideration of the contribution of each
4. Summarizing the evidence study, with a general guide to focus on the highquality studies, is warranted.
4.1 Evidence Tables 2. The judicious consideration requires evaluating the extent to which each trial contributes
4.2 GRADE Evidence Profile toward the estimate of magnitude of effect. This contribution will usually reflect study sample
4.3 Summary of Findings table size and number of outcome events – larger trials with many events will contribute more, much
5. Quality of evidence larger trials with many more events will contribute much more.
5.1 Factors determining the quality of evidence 3. One should be conservative in the judgment of rating down. That is, one should be confident
5.1.1 Study design that there is substantial risk of bias across most of the body of available evidence before one
5.2 Factors that can reduce the quality of the rates down for risk of bias.
evidence
5.2.1 Study limitations (Risk of Bias) 4. The risk of bias should be considered in the context of other limitations. If, for instance,
5.2.2 Inconsistency of results reviewers find themselves in a closecall situation with respect to two quality issues (risk of bias
and, say, precision), we suggest rating down for at least one of the two.
from a subgroup analysis 5. Reviewers will face closecall situations. They should both acknowledge that they are in such
5.2.3 Indirectness of evidence a situation, make it explicit why they think this is the case, and make the reasons for their
5.2.4 Imprecision ultimate judgment apparent.
5.2.4.1 Imprecision in guidelines For authors of systematic reviews:
reviews Systematic reviewers working within the context of Cochrane Systematic Reviews, can use the
5.2.4.3 Rating down two levels for following guidance to assess study limitations (risk of bias) in Cochrane Reviews. Chapter 8 of the
Cochrane Handbook provides a detailed discussion of studylevel assessments of risk of bias in the
imprecision
context of a Cochrane review, and proposes an approach to assessing the risk of bias for an outcome
5.2.5 Publication bias across studies as ‘low risk of bias’, ‘unclear risk of bias’ and ‘high risk of bias’ (Cochrane Handbook
5.3. Factors that can increase the quality of the Chapter 8, Section 8.7). These assessments should feed directly into the assessment of study limitations.
evidence In particular, ‘low risk of bias’ would indicate ‘no limitation’; ‘unclear risk of bias’ would indicate either
5.3.1 Large magnitude of an effect ‘no limitation’ or ‘serious limitation’; and ‘high risk of bias’ would indicate either ‘serious limitation’ or
5.3.2. Doseresponse gradient ‘very serious limitation’ in the GRADE approach. Cochrane systematic review authors must use their
5.3.3. Effect of plausible residual confounding judgment to decide between alternative categories, depending on the likely magnitude of the potential
5.4 Overall quality of evidence biases.
Every study addressing a particular outcome will differ, to some degree, in the risk of bias. Review
6.1 Recommendations and their strength authors must make an overall judgment on whether the quality of evidence for an outcome warrants
6.1.1 Strong recommendation downgrading on the basis of study limitations. The assessment of study limitations should apply to the
6.1.2 Weak recommendation studies contributing to the results in the Summary of Findings table, rather than to all studies that could
6.1.3 Recommendations to use interventions potentially be included in the analysis.
only in research
6.1.4 No recommendation Table 5.6: Guidance to assess study limitations (risk of bias) in Cochrane Reviews and corresponding
6.2 Factors determining direction and strength of GRADE assessment of quality of evidence
recommendations Risk of bias Across studies Interpretation Considerations GRADE
6.2.1 Balance of desirable and undesirable assessment of
consequences study limitations
Low Most information Plausible bias No apparent No serious
desirable and undesirable effects is from studies at unlikely to limitations. limitations, do not
6.2.1.2 Best estimates of values and low risk of bias. seriously alter the downgrade
preferences results.
magnitude of effects (quality of evidence) Unclear Most information Plausible bias that Potential No serious
6.3.3 Confidence in values and preferences is from studies at raises some doubt limitations are limitations, do not
6.3.4 Resource use (cost) low or unclear risk about the results. unlikely to lower downgrade
6.3.4.1 Differences between costs and of bias. confidence in the
estimate of effect.
other outcomes
6.3.4.2 Perspective Potential Serious
6.3.4.3 Resource implications considered limitations are limitations,
6.3.4.4 Confidence in the estimates of likely to lower downgrade one
resource use (quality of the evidence about confidence in the level.
cost) estimate of effect.
6.3.4.5 Presentation of resource use High The proportion of Plausible bias that Crucial limitation Serious
6.3.4.6 Economic model information from seriously weakens for one criterion, limitations,
6.3.4.7 Consideration of resource use in studies at high risk confidence in the or some limitations downgrade one
recommendations of bias is results. for multiple level
6.4 Presentation of recommendations sufficient to affect criteria, sufficient
6.4.1 Wording of recommandations the interpretation to lower
6.3.2 Symbolic representation of results. confidence in the
6.4.3 Providing transparent statements about estimate of effect.
assumed values and preferences Crucial limitation Very serious
6.5 The EvidencetoDecision framework for one or more limitations,
7. The GRADE approach for diagnostic tests and criteria sufficient downgrade two
strategies to substantially levels
7.1. Questions about diagnostic tests lower confidence
7.1.1. Establishing the purpose of a test in the estimate of
7.1.2. Establishing the role of a test effect.
7.2. Gold standard and reference test Example 1: Unclear Risk of Bias (Not Downgraded)
A systematic review investigated whether fewer people with cancer died when given anticoagulants
1. Overview of the GRADE Approach compared to a placebo. There were 5 RCTs. Three studies had unclear sequence generation as it was
1.1 Purpose and advantages of the GRADE not reported by authors and one study (contributing few patients to the metaanalysis) had unclear
approach allocation concealment, and incomplete outcome data. In this case, the overall limitations were not
1.2 Separation of confidence in effect estimates serious and the evidence was not downgraded for risk of bias.
Example 2: Unclear Risk of Bias (Downgraded by One Level)
approach A systematic review of the effects of testosterone on erection satisfaction in men with low testosterone
1.4 Modifications to the GRADE approach identified four RCTs. The largest trial’s results were reported only as “not significant” and could not,
2. Framing the health care question therefore, contribute to the metaanalysis. Data from the three smaller trials suggested a large treatment
2.1 Defining the patient population and intervention effect (1.3 standard deviations, 95% confidence interval 0.2, 2.3). The authors could not obtain the
2.2 Dealing with multiple comparators missing data, and could not be confident that the large treatment effect was certain, therefore, they rated
2.3 Other considerations down the body of evidence for selective reporting bias in the largest study.
2.4 Format of health care questions using the In another scenario, the review authors did obtain the complete data from the larger trial. After including
GRADE approach the less impressive results of the large trial, the magnitude of the effect was smaller and no longer
3. Selecting and rating the importance of outcomes statistically significant (0.8 standard deviations, 95% confidence interval 0.05, 1.63). In that case, the
3.1 Steps for considering the relative importance of evidence would not be downgraded.
outcomes Example 3: High Risk of Bias due to lack of blinding (Downgraded by One Level)
3.3 Using evidence in rating the importance of RCTs of the effects of Intervention A on acute spinal injury measured both allcause mortality and,
outcomes based on a detailed physical examination, motor function. The outcome assessors were not blinded for
3.4 Surrogate (substitute) outcomes any outcomes. Blinding of outcome assessors is less important for the assessment of allcause mortality,
but crucial for motor function. The quality of the evidence for the mortality outcome may not be
downgraded. However, the quality may be downgraded for the motor function outcome.
4.1 Evidence Tables
4.2 GRADE Evidence Profile Example 4: High Risk of Bias due to lack of allocation concealment (Downgraded by One Level)
4.3 Summary of Findings table A systematic review of 2 RCTs showed that family therapy for children with asthma improved daytime
5. Quality of evidence wheeze. However, allocation was clearly not concealed in the two included trials. This limitation might
5.1 Factors determining the quality of evidence warrant downgrading the quality of evidence by one level.
5.1.1 Study design
5.2 Factors that can reduce the quality of the Example 5: High Risk of Bias (Downgraded by One Level)
evidence A review was conducted to assess the effects of early versus late treatment of influenza with
5.2.1 Study limitations (Risk of Bias) oseltamivir in observational studies. Researchers found 8 observational studies which assessed the risk
5.2.2 Inconsistency of results of mortality. The statistical analysis in all 8 studies did not adjust for potential confounding risk factors
5.2.2.1 Deciding whether to use estimates such as age, chronic lung conditions, vaccination or immune status. The quality of the evidence was
from a subgroup analysis therefore downgraded from low to very low for serious limitations in study design.
5.2.3 Indirectness of evidence Example 6: High Risk of Bias (Downgraded by Two Levels)
5.2.4 Imprecision
5.2.4.1 Imprecision in guidelines Three RCTs of the effects of surgery on patients with lumbar disc prolapse measured symptoms after 1
year or longer. The RCTs suffered from inadequate concealment of allocation, and unblinded assessment
of outcome by potentially biased raters (surgeons) using a nonvalidated rating instrument. The benefit of
reviews surgery is uncertain. The quality of the evidence was downgraded by two levels due to these study
5.2.4.3 Rating down two levels for limitations quality.
imprecision
5.2.5 Publication bias Example 7: High Risk of Bias (Downgraded by Two Levels)
5.3. Factors that can increase the quality of the The evidence for the effect of sublingual immunotherapy in children with allergic rhinitis on the
evidence development of asthma comes from a single randomized trial with no description of randomization,
5.3.1 Large magnitude of an effect concealment of allocation or type of analysis, there was no blinding and 21% of children were lost to
5.3.2. Doseresponse gradient followup. These very serious limitations would warrant downgrading the quality of evidence by two
5.3.3. Effect of plausible residual confounding levels, from high to low.
6.1 Recommendations and their strength 5.2.2 Inconsistency of results
6.1.2 Weak recommendation Inconsistency refers to an unexplained heterogeneity of results.
Inconsistency refers to an unexplained heterogeneity of results.
only in research True differences in the underlying treatment effect may be likely when there are widely differing
6.1.4 No recommendation estimates of the treatment effect (i.e. heterogeneity or variability in results) across studies. Investigators
6.2 Factors determining direction and strength of should explore explanations for heterogeneity, and if they cannot identify a plausible explanation, the
recommendations quality of evidence should be downgraded. Whether it is downgraded by one or two levels will depend
on the magnitude of the inconsistency in the results.
consequences Patients vary widely in their preintervention or baseline risk of the adverse outcomes that health care
6.2.1.1 Estimates of the magnitude of the interventions are designed to prevent (e.g. death, stroke, myocardial infarction). As a result, risk
desirable and undesirable effects differences (absolute risk reductions) in subpopulations tend to vary widely. Relative risk (RR)
6.2.1.2 Best estimates of values and reductions, on the other hand, tend to be similar across subgroups, even if subgroups have substantial
preferences differences in baseline risk. Therefore, when we refer to inconsistencies in effect size, we are
6.3.2 Confidence in best estimates of referring we are referring to relative measures (risk ratios and hazard ratios, which are preferred, or
magnitude of effects (quality of evidence) odds ratios).
6.3.3 Confidence in values and preferences When easily identifiable patient characteristics confidently permit classifying patients into
6.3.4 Resource use (cost) subpopulations at appreciably different risk, absolute differences in outcome between intervention and
6.3.4.1 Differences between costs and control groups will differ substantially between these subpopulations. This may well warrant differences
other outcomes in recommendations across subpopulations, rather than downgrading the quality evidence for
6.3.4.2 Perspective inconsistency in effect size.
6.3.4.3 Resource implications considered Although there are statistical methods to measure heterogeneity, there are a variety of other criteria to
6.3.4.4 Confidence in the estimates of assess heterogeneity, which can also be used when results cannot be pooled statistically. Criteria to
resource use (quality of the evidence about determine whether to downgrade for inconsistency can be applied when results are from more than one
cost) study and include:
6.3.4.5 Presentation of resource use 1. Wide variance of point estimates across studies (note: direction of effect is not a criterion for
6.3.4.6 Economic model inconsistency)
recommendations 2. Minimal or no overlap of confidence intervals (CI), which suggests variation is more than
6.4 Presentation of recommendations what one would expect by chance alone
6.4.1 Wording of recommandations 3. Statistical criteria, including tests of heterogeneity which test the null hypothesis that all
6.3.2 Symbolic representation studies have the same underlying magnitude of effect, have a low pvalue (p <0.05), indicating to
6.4.3 Providing transparent statements about reject the null hypothesis
6.5 The EvidencetoDecision framework I2 statistic, which quantifies the proportion of the variation in point estimates due to amongstudy
7. The GRADE approach for diagnostic tests and differences, is large (see note below for decisions based on I2 statistic)
strategies
Note:
7.1.1. Establishing the purpose of a test While determining what constitutes a large I2 value is subjective, the following ruleof thumb can be
7.1.2. Establishing the role of a test used:
7.2. Gold standard and reference test ● < 40% may be low
● 3060% may be moderate
1.1 Purpose and advantages of the GRADE ● 5090% may be substantial
approach ● 75100% may be considerable
from strength of recommendations Overlaps in these ranges, and use of “may be” as terminology, illustrate the uncertainty involved in
making such judgments. It is also important to note the implicit limitations in this statistic. When
individual study sample sizes are small, point estimates may vary substantially, but because variation
approach
1.4 Modifications to the GRADE approach can be explained by chance, I2 may be low. Conversely, when study sample sizes are large, a relatively
2. Framing the health care question small difference in point estimates can yield a large I2. Another statistic, τ2 (tau square) is a measure of
2.1 Defining the patient population and intervention the variability that has an advantage over other measures in that it is not dependent on sample size.
2.2 Dealing with multiple comparators All statistical approaches have limitations, and their results should be seen in the context of a subjective
2.3 Other considerations examination of the variability in point estimates and the overlap in CIs.
GRADE approach Example 1: Differences in direction, but minimal heterogeneity
3. Selecting and rating the importance of outcomes Consider the figure below; a forest plot with four studies, two on either side of the line of no effect. We
3.1 Steps for considering the relative importance of would have no inclination to rate down for inconsistency. Differences in direction, in and of themselves,
outcomes do not constitute a criterion for variability in effect if the magnitude of the differences in point estimates
3.2 Influence of perspective is small.
3.3 Using evidence in rating the importance of [INSERT IMAGE]
outcomes
3.4 Surrogate (substitute) outcomes Example 2: When inconsistency is large, but differences are between small and large beneficial effects
4. Summarizing the evidence As we define quality of evidence for a guideline, inconsistency is important only when it reduces
4.1 Evidence Tables confidence in results in relation to a particular decision. Even when inconsistency is large, it may not
4.2 GRADE Evidence Profile reduce confidence in results regarding a particular decision. Consider, the figure below in which
4.3 Summary of Findings table variability is substantial, but the differences are between small and large treatment effects.
Guideline developers may or may not consider this degree of variability important. Systematic review
5.1 Factors determining the quality of evidence authors, much less in a position to judge whether the apparent high heterogeneity can be dismissed on the
5.1.1 Study design grounds that it is unimportant, are more likely to rate down for inconsistency.
evidence [INSERT IMAGE]
5.2.1 Study limitations (Risk of Bias) Example 3: Substantial heterogeneity, of unequivocal importance
5.2.2.1 Deciding whether to use estimates Consider the figure below. The magnitude of the variability in results is identical to that of the figure
presented in Example 2. However, because two studies suggest benefit and two suggest harm, we would
unquestionably choose to rate down the quality of evidence as a result of inconsistency.
5.2.4 Imprecision [INSERT IMAGE]
5.2.4.1 Imprecision in guidelines Example 4: Test a priori hypotheses about inconsistency even when inconsistency appears to be small
reviews A metaanalysis of randomized trials of rofecoxib looking at the outcome of myocardial infarction found
5.2.4.3 Rating down two levels for apparently consistent results (heterogeneity p=0.82, I2=0%). Yet, when the investigators examined the
imprecision effect in trials that used an external endpoint committee (RR 3.88, 95% CI: 1.88, 8.02) vs. trials that did
5.2.5 Publication bias not (RR 0.79, 95% CI: 0.29, 2.13), they found differences that were large and unlikely to be explained by
5.3. Factors that can increase the quality of the chance (p=0.01).
evidence Although the issue is controversial, we recommend that metaanalyses include formal tests of whether a
5.3.1 Large magnitude of an effect priori hypotheses explain inconsistency between important subgroups even if the variability that exists
appears to be explained by chance (e.g. high pvalues in tests of heterogeneity, and low I2 values).
5.4 Overall quality of evidence If the effect size differs across studies, explanations for inconsistency may be due to differences in:
6. Going from evidence to recommendations ● populations (e.g. drugs may have larger relative effects in sicker populations)
6.1.1 Strong recommendation ● interventions (e.g. larger effects with higher drug doses)
6.1.2 Weak recommendation ● outcomes (e.g. duration of followup)
● study methods (e.g. RCTs with higher and lower risk of bias).
only in research
6.1.4 No recommendation If inconsistency can be explained by differences in populations, interventions or outcomes, review
6.2 Factors determining direction and strength of authors should offer different estimates across patient groups, interventions, or outcomes. Guideline
recommendations panelists are then likely to offer different recommendations for different patient groups and interventions.
recommendations panelists are then likely to offer different recommendations for different patient groups and interventions.
6.2.1 Balance of desirable and undesirable If study methods provide a compelling explanation for differences in results between studies, then
consequences authors should consider focusing on effect estimates from studies with a lower risk of bias.
6.2.1.1 Estimates of the magnitude of the If large variability in magnitude of effect remains unexplained and authors fail to attribute it to
desirable and undesirable effects differences in one of these four variables, then the quality of evidence decreases. Review authors and
6.2.1.2 Best estimates of values and guideline panels should also consider the extent to which they are uncertain about the underlying effect
preferences due to the inconsistency. Uncertainty relates to how important inconsistency is to the confidence in the
6.3.2 Confidence in best estimates of result. The extent is used to decide whether to downgrade the quality rating by one or even two levels.
magnitude of effects (quality of evidence) Example 5: Making separate recommendations for subpopulations
6.3.4 Resource use (cost) When the analysis for benefits of endarterectomy was pooled across patients with stenosis of the carotid
6.3.4.1 Differences between costs and artery, there was high heterogeneity. Heterogeneity was explored and was explained by separating out
other outcomes patients who were symptomatic with high degree stenosis (in which endarterectomy was beneficial), and
patients who were asymptomatic with moderate degree stenosis (in which surgery was not beneficial).
6.3.4.2 Perspective
The authors presented and graded the evidence by patient group and did not downgrade the quality of the
6.3.4.3 Resource implications considered evidence for inconsistency. Two different recommendations were also made according to patient group
6.3.4.4 Confidence in the estimates of by the guideline panel.
cost)
5.2.2.1 Deciding whether to use estimates from a subgroup analysis
recommendations Finding an explanation for inconsistency is preferable. An explanation can be based on differences in
6.4 Presentation of recommendations population, intervention, or outcomes which mandate two or more estimates of effect, possibly with
6.4.1 Wording of recommandations separate recommendations. However, subgroups effects may prove spurious and may not explain all the
6.3.2 Symbolic representation variability in the extent of inconsistency. Indeed, most putative subgroup effects ultimately prove
6.4.3 Providing transparent statements about spurious. A cautionary note about subgroup analyses and their presentation is warranted; refer to Sun et
assumed values and preferences al. 2010 and Guyatt et al. 2011 for further reading.
6.5 The EvidencetoDecision framework Review authors and guideline developers must exercise a high degree of skepticism regarding potential
7. The GRADE approach for diagnostic tests and subgroup effect explanations, paying particular attention to criteria the following 7 criteria:
strategies
7.1. Questions about diagnostic tests 1. Is the subgroup variable a characteristic specified at baseline or after randomization?
7.1.1. Establishing the purpose of a test (subgroup hypotheses should be developed a priori)
7.1.2. Establishing the role of a test 2. Is the subgroup difference suggested by comparisons within rather than between studies?
7.2. Gold standard and reference test 3. Does statistical analysis suggest that chance is an unlikely explanation for the subgroup
difference?
1. Overview of the GRADE Approach 4. Did the hypothesis precede rather than follow the analysis and include a hypothesized
1.1 Purpose and advantages of the GRADE direction that was subsequently confirmed?
approach
1.2 Separation of confidence in effect estimates 5. Was the subgroup hypothesis one of a smaller number tested?
from strength of recommendations 6. Is the subgroup difference consistent across studies and across important outcomes?
7. Does external evidence (biological or sociological rationale) support the hypothesized
approach subgroup difference?
2. Framing the health care question The credibility of subgroup effects is not a matter of yes or no, but a continuum. Judgement is required
2.1 Defining the patient population and intervention to determine how convincing a subgroup analysis is based on the above criteria.
2.2 Dealing with multiple comparators Example 6: Subgroup analysis explaining inconsistency in results
2.4 Format of health care questions using the A systematic review and individual patient data metaanalysis (IPDMA) addressed the impact of high
vs. low positive endexpiratory pressures (PEEPs) in three randomized trials that enrolled 2,299 adult
GRADE approach
patients with severe acute lung injury requiring mechanical ventilation.
3.1 Steps for considering the relative importance of The results of this IPDMA suggested a possible reduction in deaths in hospital with the higher PEEP
outcomes strategy, but the difference was not statistically significant (RR 0.94; 95% CI: 0.86, 1.04). In patients
3.2 Influence of perspective with severe disease (labeled acute respiratory distress syndrome), the effect more clearly favored the
3.3 Using evidence in rating the importance of high PEEP strategy (RR 0.90; 95% CI: 0.81, 1.00; P50.049). In patients with mild disease, results
outcomes suggested that the high PEEP strategy may be inferior (RR 1.37; 95% CI: 0.98, 1.92).
3.4 Surrogate (substitute) outcomes Applying the seven criteria (see table below), we find that six are met fully, and the seventh,
4. Summarizing the evidence consistency across trials and outcomes, partially: the results of the subgroup analysis were consistent
4.1 Evidence Tables across the three studies, but other ways of measuring severity of lung injury (for instance, treating
4.2 GRADE Evidence Profile severity as a continuous variable) failed to show a statistically significant interaction between the
4.3 Summary of Findings table severity and the magnitude of effect. In this case, the subgroup analysis is relatively convincing.
5. Quality of evidence [INSERT IMAGE]
5.1.1 Study design Example 7: Subgroup analysis not very likely to explain inconsistency in results
5.2 Factors that can reduce the quality of the Three randomized trials have tested the effects of vasopressin vs. epinephrine on survival in patients
evidence with cardiac arrest. The results show appreciable differences in point estimates, widely overlapping CIs,
5.2.1 Study limitations (Risk of Bias) a pvalue for the test of heterogeneity of 0.21 and an I2 of 35%.
5.2.2 Inconsistency of results Two of the trials included both patients in whom asystole was responsible for the cardiac arrest and the
5.2.2.1 Deciding whether to use estimates patients in whom ventricular fibrillation was the offending rhythm. One of these two trials reported a
from a subgroup analysis borderline statistically significant benefit our own analysis was borderline nonsignificant of
5.2.3 Indirectness of evidence vasopressin over epinephrine restricted to patients with asystole (in contrast to patients whose cardiac
5.2.4 Imprecision arrest was induced by ventricular fibrillation).
It is not very likely that the subgroup analysis can explain the moderate inconsistency in the results.
Chance can explain the putative subgroup effect and the hypothesis fails other criteria (including small
reviews number of a priori hypotheses and consistency of effect). Here, guideline developers should make
5.2.4.3 Rating down two levels for recommendations on the basis of the pooled estimate of data from both the groups. Whether the quality
imprecision of evidence should be rated down for inconsistency is another judgment call; we would argue for not
5.2.5 Publication bias rating down for inconsistency.
evidence [INSERT IMAGE]
5.3.3. Effect of plausible residual confounding 5.2.3 Indirectness of evidence
6. Going from evidence to recommendations We are more confident in the results when we have direct evidence. Direct evidence consists of
6.1 Recommendations and their strength research that directly compares the interventions which we are interested in, delivered to the populations
6.1.1 Strong recommendation in which we are interested, and measures the outcomes important to patients.
6.1.2 Weak recommendation Authors of systematic reviews and guideline panels making recommendations should consider the extent
6.1.3 Recommendations to use interventions to which they are uncertain about the applicability of the evidence to their relevant question and
only in research downgrade the quality rating by one or even two levels.
6.2 Factors determining direction and strength of For authors of systematic reviews:
recommendations Directness is judged by the users of evidence tables, depending on the target population, intervention,
6.2.1 Balance of desirable and undesirable and outcomes of interest. Authors of systematic reviews should answer the health care question they
consequences asked and, thus, they will rate the directness of evidence they found. The considerations made by the
6.2.1.1 Estimates of the magnitude of the authors of systematic reviews may be different than those of guideline panels that use the systematic
desirable and undesirable effects reviews. The more clearly and explicitly the health care question was formulated the easier it will be for
desirable and undesirable effects reviews. The more clearly and explicitly the health care question was formulated the easier it will be for
6.2.1.2 Best estimates of values and the users to understand systematic review authors' judgments.
preferences There are four sources of indirectness:
magnitude of effects (quality of evidence) 1. Differences in population (applicability)
6.3.3 Confidence in values and preferences Differences between study populations within a systematic review are a common problem for systematic
6.3.4 Resource use (cost) review authors and guideline panels. When this occurs evidence is indirect. The effect on overall quality
6.3.4.1 Differences between costs and of evidence will vary depending on how different the study populations are, as a result quality may not
other outcomes decrease, decrease by a one level or decrease by two levels in extreme cases.
6.3.4.2 Perspective The above discussion refers to different human populations, but sometimes the only evidence will be
6.3.4.3 Resource implications considered from animal studies, such as rats or primates. In general, we would rate such evidence down two levels
6.3.4.4 Confidence in the estimates of for indirectness. Animal studies may, however, provide an important indication of drug toxicity. Although
resource use (quality of the evidence about toxicity data from animals does not reliably predict toxicity in humans, evidence of animal toxicity
cost) should engender caution in recommendations. Other types of nonhuman studies (e.g. laboratory
6.3.4.5 Presentation of resource use evidence) may generate high quality evidence
6.3.4.6 Economic model Example 1: Indirectness in Populations (Downgraded by Two Levels)
recommendations Highquality randomized trials have demonstrated the effectiveness of antiviral treatment for seasonal
6.4 Presentation of recommendations influenza. The panel judged that the biology of seasonal influenza was sufficiently different from that of
6.4.1 Wording of recommandations avian influenza (avian influenza organism may be far less responsive to antiviral agents than seasonal
6.3.2 Symbolic representation influenza) that the evidence required rating down quality by two levels, from high to low, due to
indirectness.
assumed values and preferences Example 2: Nonhuman studies providing high quality evidence (Not Downgraded)
6.5 The EvidencetoDecision framework Consider laboratory evidence of change in resistance patterns of bacteria to antimicrobial agents (e.g.
7. The GRADE approach for diagnostic tests and the emergence of methicillinresistant staphylococcus aureus MRSA). These laboratory findings may
strategies constitute high quality evidence for the superiority of antibiotics to which MRSA is sensitive vs.
7.1. Questions about diagnostic tests methicillin as the initial treatment of suspected staphylococcus sepsis in settings in which MRSA is
7.1.1. Establishing the purpose of a test highly prevalent.
2. Differences in interventions (applicability)
7.2. Gold standard and reference test Systematic reviewers will make a concerted effort to ensure that only studies with directly relevant
interventions are included in their review. However, exceptions may still occur. Generally, when
1. Overview of the GRADE Approach interventions that are indirectly related to the study are included in systematic review, evidence quality
1.1 Purpose and advantages of the GRADE will be decreased. In some instances the intervention used will be the same, but may be delivered in
approach differently depending on the setting.
1.2 Separation of confidence in effect estimates Example 3: Interventions delivered differently in different settings (Downgraded by One Level)
A systematic review of music therapies for autism found that trials tested structured approaches that are
used more commonly in North America than in Europe. Because the interventions differ, the results from
approach structured approaches are more applicable to North America and the results of less structured
1.4 Modifications to the GRADE approach approaches are more applicable in Europe.
2.1 Defining the patient population and intervention Guideline panelists should consider rating down the quality of the evidence if the intervention cannot be
2.2 Dealing with multiple comparators implemented with the same rigor or technical sophistication in their setting as in the RCTs from which
2.3 Other considerations the data come.
2.4 Format of health care questions using the Example 4: Trials of related interventions (Downgraded by One or Two Levels)
GRADE approach
Guideline developers may often find the best evidence addressing their question in trials of related, but
3. Selecting and rating the importance of outcomes different, interventions. A guideline addressing the value of colonoscopic screening for colon cancer will
3.1 Steps for considering the relative importance of find the randomized control trials (RCTs) of fecal occult blood screening that showed a decrease in
outcomes colon cancer mortality. Whether to rate down quality by one or two levels due to indirectness in this
3.2 Influence of perspective context is a matter of judgment.
outcomes Example 5: Indirectness in Interventions (Not Downgraded)
3.4 Surrogate (substitute) outcomes Older trials show a high efficacy of intramuscular penicillin for gonococcal infection, but guidelines
4. Summarizing the evidence might reasonably recommend alternative antibiotic regimes based on current local in vitro resistance
4.1 Evidence Tables patterns, which would not warrant downgrading the quality of evidence for indirectness.
4.2 GRADE Evidence Profile Example 6: Interventions not sufficiently different (Not Downgraded)
5. Quality of evidence Trials of simvastatin show cardiovascular mortality reduction. Suggesting night rather than morning
5.1 Factors determining the quality of evidence dosing (because of greater cholesterol reduction) would not warrant rating down quality for differences
5.1.1 Study design in the intervention.
5.2 Factors that can reduce the quality of the 3. Differences in outcomes measures (surrogate outcomes)
evidence
GRADE specifies that both those conducting systematic reviews and those developing practice
5.2.1 Study limitations (Risk of Bias) guidelines should begin by specifying every important outcome of interest. The available studies may
5.2.2 Inconsistency of results have measured the impact of the intervention of interest on outcomes related to, but different from, those
5.2.2.1 Deciding whether to use estimates of primary importance to patients.
5.2.3 Indirectness of evidence The difference between desired and measured outcomes may relate to time frame (e.g. outcome
5.2.4 Imprecision measured at 3months vs. at 12months). Another source of indirectness related to measurement of
5.2.4.1 Imprecision in guidelines outcomes is the use of substitute or surrogate endpoints in place of the patientimportant outcome of
interest.
reviews Table 5.7: Common surrogate measures and corresponding patientimportant outcomes
Condition Patientimportant outcome(s) Surrogate outcome(s)
imprecision
5.2.5 Publication bias Diabetes mellitus Diabetic symptoms, hospital Blood glucose, A1C
5.3. Factors that can increase the quality of the admission, complications
evidence (cardiovascular, eye, renal,
5.3.1 Large magnitude of an effect neuropathic)
5.3.2. Doseresponse gradient Hypertension Cardiovascular death, Blood pressure
5.3.3. Effect of plausible residual confounding myocardial infarction, stroke
6. Going from evidence to recommendations Dementia Patient function, behavior, Cognitive function
6.1 Recommendations and their strength caregiver burden
6.1.1 Strong recommendation Osteoporosis Fractures Bone density
Adult Respiratory Distress Mortality Oxygenation
Syndrome
only in research
6.1.4 No recommendation Endstage renal disease Quality of life, morbidity (such Hemoglobin
6.2 Factors determining direction and strength of as shunt thrombosis or heart
recommendations failure), mortality
6.2.1 Balance of desirable and undesirable Venous thrombosis Symptomatic venous thrombosis Asymptomatic venous
consequences thrombosis
desirable and undesirable effects Chronic respiratory disease Quality of life, exacerbations, Pulmonary function, exercise
6.2.1.2 Best estimates of values and mortality capacity
preferences Cardiovascular disease Myocardial infarction, vascular Serum lipids, coronary
6.3.2 Confidence in best estimates of events, mortality calcification, calcium/phosphate
events, mortality calcification, calcium/phosphate
magnitude of effects (quality of evidence) metabolism
In general, the use of a surrogate outcome requires rating down the quality of evidence by one, or even
6.3.4 Resource use (cost) two, levels. Consideration of the biology, mechanism, and natural history of the disease can be helpful in
6.3.4.1 Differences between costs and making a decision about indirectness. For surrogates that are far removed in the putative causal pathway
other outcomes from the patientimportant endpoints, we would rate down the quality of evidence with respect to this
6.3.4.2 Perspective outcome by two levels. Surrogates that are closer in the putative causal pathway to the outcomes warrant
6.3.4.3 Resource implications considered rating down by only one level for indirectness.
resource use (quality of the evidence about Example 7: Time differences in outcomes (Downgraded by One Level)
cost) A systematic review of behavioral and cognitivebehavioral interventions for outwardly directed
6.3.4.5 Presentation of resource use aggressive behavior in people with learning disabilities showed that a program of 3week relaxation
6.3.4.6 Economic model training significantly reduced disruptive behaviors at 3 months. Unfortunately, no eligible trial assessed
6.3.4.7 Consideration of resource use in the review authors’ predefined outcome of interest, the longterm impact defined as effect at 9 months or
recommendations greater. The argument for rating down quality for indirectness becomes stronger when one considers that
6.4 Presentation of recommendations other types of behavioral interventions have shown an early beneficial effect that was not sustained at 6
6.4.1 Wording of recommandations months followup.
6.3.2 Symbolic representation Example 8: Surrogate outcomes (Downgraded by One or Two Levels)
Calcium and phosphate metabolism are far removed in the causal pathway from patientimportant
assumed values and preferences outcomes such as myocardial infarction, and warrant rating down the quality of evidence by two levels.
6.5 The EvidencetoDecision framework Surrogate outcomes that are closer in the causal pathway to the patientimportant outcomes such as
7. The GRADE approach for diagnostic tests and coronary calcification for myocardial infarction, bone density for fractures, and softtissue calcification
strategies for pain, warrant rating down quality by one level for indirectness.
7.1.1. Establishing the purpose of a test Example 9: Uncertainty in the relationship between surrogate and Surrogate outcomes (Downgraded by
7.1.2. Establishing the role of a test One or Two Levels)
7.2. Gold standard and reference test Investigators examined the “validity” of progressionfree survival as a surrogate for overall survival for
anthracycline and taxinebased chemotherapy in advanced breast cancer. They found a statistically
1. Overview of the GRADE Approach significant association between progressionfree and overall survival in the randomized trials they
1.1 Purpose and advantages of the GRADE analyzed, but predicting overall survival using progressionfree survival remained uncertain. Rating
approach down quality by one level for indirectness would be appropriate in this situation.
1.2 Separation of confidence in effect estimates 4. Indirect Comparisons)
Occurs when a comparison of intervention A versus B is not available, but A was compared with C and
B was compared with C. Such studies allow indirect comparisons of the magnitude of effect of A versus
approach B. As a result of the indirect comparison, this evidence is of lower quality than headtohead
1.4 Modifications to the GRADE approach comparisons of A and B would provide.
2.1 Defining the patient population and intervention The validity of the indirect comparison rests on the assumption that factors in the design of the trial (the
2.2 Dealing with multiple comparators patients, cointerventions, measurement of outcomes) and the methodological quality are not sufficiently
2.3 Other considerations different to result in different effects (in other words, true differences in effect explain all apparent
2.4 Format of health care questions using the differences). Some authors refer to this as the “similarity assumption”. Because this assumption is
always in some doubt, indirect comparisons always warrant rating down by one level in quality of
GRADE approach
evidence. Whether to rate down two levels depends on the plausibility that alternative factors
3. Selecting and rating the importance of outcomes (population, interventions, cointerventions, outcomes, and study methods) explain or obscure differences
3.1 Steps for considering the relative importance of in effect.
outcomes
3.2 Influence of perspective Example 10: Indirect comparison of low vs. mediumdose aspirin (Downgraded by One Level)
3.3 Using evidence in rating the importance of A systematic review considered the relative merits of low dose vs. medium dose of aspirin to prevent
outcomes graft occlusion after coronary artery bypass surgery. Authors found five relevant trials that compared
3.4 Surrogate (substitute) outcomes aspirin with placebo, of which two tested medium dose and three lowdose aspirin. The pooled relative
4. Summarizing the evidence risk of the likelihood of a graft occlusion was 0.74 (95% CI: 0.60, 0.91) in the lowdose trial and 0.55
4.1 Evidence Tables (95% CI: 0.28, 0.82) in the mediumdose trials. The RR of medium vs. low dose was 0.74 (95% CI: 0.52,
4.2 GRADE Evidence Profile 1.06; P = 0.10) suggesting the possibility of a larger effect with the mediumdose regimens. This
4.3 Summary of Findings table comparison is weaker than if the randomized trials had compared the two aspirin dose regimens directly
5. Quality of evidence because there are other study characteristics that might be responsible for any differences found.
5.1 Factors determining the quality of evidence Example 11: Network metaanalysis (Downgraded by Two Levels)
5.1.1 Study design
5.2 Factors that can reduce the quality of the Investigators conducted a simultaneous treatment comparison of 12 new generation antidepressants. The
authors evaluated 117 randomized trials involving over 25,000 patients; their article provides no
evidence
information about the similarity of the patients, or about cointervention. In correspondence with the
5.2.1 Study limitations (Risk of Bias) authors, however, they indicated that they excluded trials with treatmentresistant depression, argued
5.2.2 Inconsistency of results that different types of depression have similar treatment responses, and that it is very likely that patients
5.2.2.1 Deciding whether to use estimates did not receive important cointervention. With respect to risk of bias, the authors tell us, using the
from a subgroup analysis Cochrane collaboration approach to assessing risk of bias that risk of bias in most studies was “unclear”,
5.2.3 Indirectness of evidence and 12 were at low risk of bias; presumably a small number was at high risk of bias. This is helpful,
5.2.4 Imprecision although “unclear” represents a wide range of risk of bias. All studies involved headtohead
5.2.4.1 Imprecision in guidelines comparisons between at least two of the 12 drugs; the 117 trials involved 70 individual comparisons
5.2.4.2 Imprecision in in systematic (e.g., two comparisons between fluoxetine and fluvoxamine). The authors reported statistically
reviews significant differences between direct and indirect comparisons in only three of 70 comparisons of drug
5.2.4.3 Rating down two levels for response. The power of such tests was, however, not likely high. Overall, we would be inclined to take a
imprecision cautious approach to this network metaanalysis and rate down two levels for indirectness.
evidence 5.2.4 Imprecision
5.3.2. Doseresponse gradient In general, results are imprecise when studies include relatively few patients and few events and thus
5.3.3. Effect of plausible residual confounding have a wide confidence interval (CI) around the estimate of the effect. In this case, one may judge the
5.4 Overall quality of evidence quality of the evidence lower than it otherwise would be considered because of resulting uncertainty
6. Going from evidence to recommendations about the results.
In addition to describing how the 95% confidence interval should be used as the primary criterion to
6.1.1 Strong recommendation make judgements about imprecision, we introduce theoptimal information size (OIS) as a second,
6.1.2 Weak recommendation necessary criterion for determining adequate precision.
only in research Because GRADE defines the quality of evidence differently for systematic reviews and for
6.1.4 No recommendation guidelines, the criteria for downgrading for imprecision differ in that guideline panels need to consider
6.2 Factors determining direction and strength of the context of a recommendation and other outcomes, whereas judgments about specific outcomes in a
recommendations systematic review are free of that context. The GRADE approach, therefore, suggests separate guidance
for determining imprecision as is described in the following sections.
consequences
desirable and undesirable effects 5.2.4.1 Imprecision in guidelines
preferences For guideline panels:
magnitude of effects (quality of evidence) Quality of evidence refers to the extent to which our confidence in the estimate of an effect is
6.3.3 Confidence in values and preferences adequate to support a particular decision. In guidelines all outcomes are considered together, with
6.3.4 Resource use (cost) attention to whether they are critical, or important but not critical.
6.3.4.1 Differences between costs and For guideline panels, the decision to rate down the quality of evidence for imprecision is dependent on
6.3.4.1 Differences between costs and For guideline panels, the decision to rate down the quality of evidence for imprecision is dependent on
other outcomes the threshold that represents the basis for a management decision and consideration of the tradeoff
6.3.4.2 Perspective between desirable and undesirable consequences. Determining the acceptable threshold inevitably
6.3.4.3 Resource implications considered involves judgement that must be made explicit.
6.3.4.4 Confidence in the estimates of For dichotomous outcomes
cost) Guideline developers must consider the context of the particular recommendation to determine whether
6.3.4.5 Presentation of resource use the results of a dichotomous (binary) outcome are sufficiently precise to support that recommendation.
6.3.4.6 Economic model Setting a specific threshold for an acceptable estimate of treatment effect will involve judgement in the
6.3.4.7 Consideration of resource use in context of factors such as side effects, drug toxicity, and cost (see Example 1). Examining the lower and
upper boundaries of the CI in relation to the threshold set by the guideline panel, then determining
recommendations
whether criteria for the optimal information size are met, will help in deciding whether to rate down for
6.4 Presentation of recommendations imprecision.
6.3.2 Symbolic representation We suggest that guideline developers consider the following steps in deciding whether to rate down the
6.4.3 Providing transparent statements about quality of evidence for imprecision in guidelines:
assumed values and preferences 1. First consider whether the boundaries of the CI are on the same side of their decisionmaking
6.5 The EvidencetoDecision framework threshold. Does the CI cross the clinical decision threshold between recommending and not
7. The GRADE approach for diagnostic tests and recommending treatment? If the answer is yes (i.e. the CI crosses the threshold), rate
strategies down for imprecision irrespective of where the point estimate and CI lie. (see Example 1)
7.1. Questions about diagnostic tests 2. If the threshold is not crossed, are criteria for an optimal information size met? (see note on
7.1.1. Establishing the purpose of a test OIS and Example 3)
7.1.3. Clear clinical questions 3. Or,
7.2. Gold standard and reference test 4. Is the event rate very low and the sample size very large (at least 2000, and perhaps 4000
patients)? (see Exception note)
1. Overview of the GRADE Approach 5. If neither criterion is met, rate down for imprecision.
approach While confidence intervals mostly capture the extent of imprecision, they can be misleading in certain
1.2 Separation of confidence in effect estimates circumstances because of fragility. Specifically, CIs may appear robust, but small numbers of events
from strength of recommendations may render the results fragile. Confidence intervals assume all patients are at the same risk (i.e. there is
prognostic balance), an assumption that is false. Randomization will ameliorate the problem by
balancing prognostic factors between intervention and control groups, but we can be confident that a
approach prognostic balance has been achieved only if sample sizes are large. Large treatment effects in the
1.4 Modifications to the GRADE approach presence of small sample sizes, even in RCTs, may be because of prognostic imbalance and warrant
2. Framing the health care question caution.
2.2 Dealing with multiple comparators Early trials addressing a particular question will, particularly if small, substantially overestimate the
2.3 Other considerations treatment effect. A systematic review of these trials will subsequently also generate an overestimated
2.4 Format of health care questions using the treatment effect. Examples of metaanalyses generating apparent beneficial or harmful effects refuted
by subsequent larger trials include magnesium for mortality reduction after myocardial infarction,
GRADE approach
angiotensinconverting enzyme inhibitors for reducing the incidence of diabetes, nitrates for mortality
3. Selecting and rating the importance of outcomes reduction in myocardial infarction, and aspirin for reduction of pregnancyinduced hypertension. A
3.1 Steps for considering the relative importance of similar circumstance occurs when trials are stopped early for benefit (i.e. prior to reaching the total
outcomes number of events, or the sample size, needed as was calculated for an adequately powered trial).
3.2 Influence of perspective Simulation studies and empirical evidence suggest that trials stopped early overestimate treatment
3.3 Using evidence in rating the importance of effects (see Example 4). When a treatment effect is overestimated, the CI around the effect may falsely
outcomes appear suitable to meet the clinical decision threshold criterion by indicating adequate precision.
Therefore, the clinical decision threshold criterion is not completely sufficient to deal with issues of
precision, and the second OIS criterion is required.
4.1 Evidence Tables
4.2 GRADE Evidence Profile Note: The Optimal Information Size (OIS)
4.3 Summary of Findings table In order to address the vulnerability of confidence intervals as a criterion for adequate precision, we
5. Quality of evidence suggest the “optimal information size” as a second, necessary criterion to consider. The OIS is applied
5.1 Factors determining the quality of evidence as a rule according to the following:
5.1.1 Study design
5.2 Factors that can reduce the quality of the ● If the total number of patients included in a systematic review is less than the number of
patients generated by a conventional sample size calculation for a single adequately powered
evidence
trial, consider rating down for imprecision.
5.2.2 Inconsistency of results Many online calculators for sample size calculation are available. A simple one can be found
5.2.2.1 Deciding whether to use estimates at http://www.stat.ubc.ca/rollin/stats/ssize/b2.html. As an alternative to calculating the OIS, guideline
from a subgroup analysis developers can also consult figures that show the relationship between sample size required, or number
5.2.3 Indirectness of evidence of events needed, and effect size. See Example 2 demonstrating how these figures can be used.
5.2.4 Imprecision Exception: Low event rates with large sample size, an exception to the need for OIS
When event rates are low, CIs around relative effects may be wide, but if sample sizes are sufficiently
large, it is likely that prognostic balance has indeed been achieved and CIs around absolute effects may
reviews be narrow. Under such circumstances, judgment about precision may be based on the CI around the
5.2.4.3 Rating down two levels for absolute effect and one may not downgrade the quality of evidence for imprecision. (see Examples 5
imprecision and 6)
5.3. Factors that can increase the quality of the Example 1: Setting clinical decision thresholds to determine imprecision in guidelines
evidence Refer to the figure below. A hypothetical systematic review of randomized control trials of an
5.3.1 Large magnitude of an effect intervention to prevent major strokes yields a point estimate of the absolute reduction in strokes of 1.3%,
5.3.2. Doseresponse gradient with a 95% CI of 0.6% to 2.0%. This translates to a number needed to treat (NNT) of 77 (100÷1.3)
5.3.3. Effect of plausible residual confounding patients for a year to prevent a single stroke. The 95% CI around the NNT is 50 to 167. Therefore, while
5.4 Overall quality of evidence 77 is our best estimate, we may need to treat as few as 50 or as many as 167 people to prevent a single
6. Going from evidence to recommendations stroke.
6.1 Recommendations and their strength [INSERT IMAGE]
6.1.2 Weak recommendation If we consider that the intervention is a drug with no serious adverse effects, minimal inconvenience,
6.1.3 Recommendations to use interventions and modest cost, we may set a threshold for an absolute reduction in strokes of 0.5%, or NNT=200
(green line in the figure above), as even this small effect would warrant a recommendation. The entire
only in research
CI (0.6% to 2.0%) lies to the left of the 0.5% threshold and, therefore, excludes any benefit smaller than
6.1.4 No recommendation the threshold. We can conclude that the precision of the evidence is sufficient to support a
6.2 Factors determining direction and strength of recommendation and do not rate down the quality of evidence for imprecision.
recommendations
6.2.1 Balance of desirable and undesirable On the other hand, if the drug is associated with serious toxicity, we may be reluctant to make a
consequences recommendation unless the absolute stroke reduction is at least 1%, or NNT=100 (red line in the figure
6.2.1.1 Estimates of the magnitude of the above). Under these circumstances, the precision is insufficient as the CI encompasses treatment effects
desirable and undesirable effects smaller than this threshold (i.e. as small as 0.6%). A recommendation in favour of the intervention would
still be appropriate as the point estimate of 1.3% meets the threshold, but we would rate down the quality
of evidence supporting the recommendation by one level for imprecision (e.g. from high to moderate).
preferences
6.3.2 Confidence in best estimates of Example 2: Using figures to determine Optimal Information Size
magnitude of effects (quality of evidence) As an alternative to calculating the OIS, review and guideline authors can also consult a figure to
6.3.3 Confidence in values and preferences determine the OIS. The figure below presents the required sample size (assuming α of 0.05, and β of
6.3.4 Resource use (cost) 0.2) for RRR of 20%, 25%, and 30% across varying control group risks. For example, if the best estimate
6.3.4.1 Differences between costs and of control group risk was 0.2 and one specifies an RRR of 25%, the OIS is approximately 2000 patients.
other outcomes
[INSERT IMAGE]
6.3.4.2 Perspective
6.3.4.3 Resource implications considered Power is, however, more closely related to number of events than to sample size. The figure below
6.3.4.4 Confidence in the estimates of presents the same relationships using total number of events across all studies in both treatment and
6.3.4.4 Confidence in the estimates of presents the same relationships using total number of events across all studies in both treatment and
resource use (quality of the evidence about control groups instead of total number of patients. Using the same choices of a control group risk of 0.2
cost) and RRR 25%, one requires approximately 325 events to meet OIS criteria.
6.3.4.5 Presentation of resource use [INSERT IMAGE]
6.3.4.7 Consideration of resource use in Note: Choice of Relative Risk Reduction
recommendations We have suggested using RRRs of 20% to 30% for calculating OIS. The choice of RRR is a matter of
6.4 Presentation of recommendations judgment, and there may be instances in which compelling prior information would suggest choosing a
6.4.1 Wording of recommandations smaller or larger value for the RRR for the OIS calculation.
6.3.2 Symbolic representation Example 3: Applying the OIS Criterion
assumed values and preferences A systematic review of flavonoids for treatment of hemorrhoids examined the outcome of failure to
6.5 The EvidencetoDecision framework achieve an important symptom reduction. In calculating the OIS, the authors chose a conservative α of
7. The GRADE approach for diagnostic tests and 0.01 and RRR of 20%, a β of 0.2, and a control group risk of 50%. The calculated OIS was marginally
larger than the total sample size included (1194 vs. 1102 patients).
strategies
7.1. Questions about diagnostic tests A more dramatic example comes from a systematic review and metaanalysis of fluoroquinolone
7.1.1. Establishing the purpose of a test prophylaxis for patients with neutropenia. Only one of eight studies that contributed to the metaanalysis
7.1.2. Establishing the role of a test met conventional criteria for statistical significance, but the pooled estimate suggested an impressive and
7.1.3. Clear clinical questions robust reduction in infectionrelated mortality with prophylaxis (RR: 0.38; 95% CI: 0.21, 0.69). The total
7.2. Gold standard and reference test number of events was only 69 and the total number of patients 1022. Considering the control group risk
of 6.9% and setting α of 0.05, β of 0.02, and RRR of 25% results in an OIS of 6400 patients. This meta
1. Overview of the GRADE Approach analysis fails to meet OIS criteria, and rating down for imprecision may be warranted.
1.1 Purpose and advantages of the GRADE Example 4: Stopping trials early may result in overestimated treatment effects and incorrect judgements
approach about precision
from strength of recommendations Consider a randomized trial of β blockers in 112 patients undergoing surgery for peripheral vascular
diseases that fulfilled preplanned O’BrieneFleming criteria for early stopping. Of 59 patients given
bisoprolol, 2 suffered a death or nonfatal myocardial infarction, as did 18 of 53 control patients. Despite
approach a total of only 20 events, the 95% CI around the RR (0.02 to 0.41) excludes all but a large treatment
1.4 Modifications to the GRADE approach effect. The CI suggests that the smallest plausible effect is a 59% RRR. A recommendation to
2. Framing the health care question administer treatment based on this result would be deemed to have adequate precision.
2.2 Dealing with multiple comparators However, there are reasons to doubt the estimate of the magnitude of effect from this trial. First, it is
2.3 Other considerations much larger than what we might expect on the basis of β blockers effects in a wide variety of other
2.4 Format of health care questions using the situations. Second, the study was terminated early on the basis of the large effect. Third, we have a
sense of the fragility of these results as concluding that an RRR less than 59% is implausible on the
GRADE approach
basis of only 20 events violates common sense. If one moves just five events from the control to the
3. Selecting and rating the importance of outcomes intervention group, the results lose their statistical significance, and the new point estimate (an RRR of
3.1 Steps for considering the relative importance of 52%) is outside of the original CI.
outcomes
3.2 Influence of perspective Example 5: Focusing on absolute effects when event rates are low and sample size is large
3.3 Using evidence in rating the importance of A systematic review of seven randomized trials of angioplasty versus carotid endarterectomy for
outcomes cerebrovascular disease found that a total of 16 of 1482 (1.1%) patients receiving angioplasty died, as
3.4 Surrogate (substitute) outcomes did 19 of 1465 (1.3%) undergoing endarterectomy. Looking at the 95% CI (0.43, 1.66) around the point
4. Summarizing the evidence estimate of the RR (0.85), the results are consistent with substantial benefit and substantial harm,
4.1 Evidence Tables suggesting the need to rate down for imprecision.
4.2 GRADE Evidence Profile The absolute difference, however, suggests a different conclusion. The absolute difference in death rates
4.3 Summary of Findings table between the two procedures is very small (absolute difference of 0.2% with a 95% CI ranging from
5. Quality of evidence 0.5% to 1.0%). Setting a clinical decision threshold boundary of 1% absolute difference (the smallest
5.1 Factors determining the quality of evidence difference important to patients), the results of the systematic review exclude a difference favoring
5.1.1 Study design either procedure. If one accepted this clinical decision threshold as appropriate, one would not rate down
5.2 Factors that can reduce the quality of the for imprecision. One could argue that a difference of less than 1% could be important to patients: if so,
evidence one would rate down for imprecision, even after considering the CI around the absolute difference, as
5.2.1 Study limitations (Risk of Bias) the CI would cross that threshold.
5.2.2 Inconsistency of results Example 6: No need to rate down for imprecision when sample sizes are very large
from a subgroup analysis A metaanalysis of randomized trials of β blockade for preventing cardiovascular events in patients
5.2.3 Indirectness of evidence undergoing noncardiac surgery suggested a doubling of the risk of strokes with β blockers (RR: 2.22;
5.2.4 Imprecision 95% CI: 1.39, 3.56). Most trials in this metaanalysis do not suffer from important limitations, the
5.2.4.1 Imprecision in guidelines evidence is direct and consistent, and publication bias is undetected. Given the lower boundary of the CI
(an increase in RR of 39%), the threshold for adequate precision would not be crossed if one believed
that most patients would be reluctant to use β blockers with an increase in RR of stroke of 39%.
reviews
5.2.4.3 Rating down two levels for The total number of events (75), however, appears insufficient, an inference that is confirmed with an
imprecision OIS calculation (α 0.05, β 0.20, using the βblocker group’s 1% event rate as the control, and Δ 0.25,
5.2.5 Publication bias total sample size 43586 in comparison to the 10889 patients actually enrolled). The guidelines for
5.3. Factors that can increase the quality of the calculating precision we have suggested would, therefore, mandate rating down quality for imprecision.
evidence With a sample size of over 5000 patients per group, however, it is very likely that randomization has
5.3.1 Large magnitude of an effect succeeded in creating prognostic balance. If that is true, β blockers really do increase the risk of stroke.
5.3.2. Doseresponse gradient Not rating down for imprecision in this situation is therefore appropriate. Preliminary information
5.3.3. Effect of plausible residual confounding suggests that for low baseline risk (<5%) one will be safe with regard to prognostic balance with a total
5.4 Overall quality of evidence of 4000 patients (2000 patients per group). Availability of this number of patients would mandate not
6. Going from evidence to recommendations rating down for imprecision despite not meeting the OIS criterion.
6.1 Recommendations and their strength For continuous outcomes
6.1.2 Weak recommendation Considerations of rating down quality because of imprecision for continuous variables follow the same
6.1.3 Recommendations to use interventions logic as for binary variables. The process begins by rating down the quality for imprecision if a
recommendation would be altered if the lower versus the upper boundary of the CI represented the true
only in research
underlying effect. If the CI does not cross this threshold, but the evidence fails to meet the OIS criterion,
6.1.4 No recommendation guideline authors should consider rating down the quality of evidence for imprecision. In this instance,
6.2 Factors determining direction and strength of judging the OIS criterion will require a sample size calculation for the continuous variable.
recommendations
6.2.1 Balance of desirable and undesirable In the context of a guideline, the decisionmaking threshold for an acceptable estimate of treatment
consequences requires consideration of the full context of the recommendation, including other outcomes such as all
6.2.1.1 Estimates of the magnitude of the potential benefits and important adverse effects (see Example 7).
desirable and undesirable effects Example 7: Considering the full context of a recommendation
A systematic review suggests that corticosteroid administration decreases the length of hospital stay in
preferences patients with exacerbations of chronic obstructive pulmonary disease (COPD) by 1.42 days (95% CI:
6.3.2 Confidence in best estimates of 0.65, 2.2). The lower boundary of the CI is 0.65 days, a rather small effect size that may not be
magnitude of effects (quality of evidence) considered important to patients.
6.3.4 Resource use (cost) As it turns out, steroids also reduce the likelihood of treatment failure (variably defined) during inpatient
6.3.4.1 Differences between costs and or outpatient followup (RR: 0.54; 95% CI: 0.41, 0.71). The best estimate of likelihood of symptomatic
other outcomes deterioration in those not treated with steroids is approximately 30%. By administering steroids to these
patients, the risk is reduced from 30% to 16% (30[0.54x30]), a difference of 14%, and the effect is
6.3.4.2 Perspective
unlikely to be less than 9% (30[0.71x30]).
6.3.4.4 Confidence in the estimates of Adverse effects were poorly reported in the studies. The only consistently reported problem was
resource use (quality of the evidence about hyperglycemia, which was increased almost sixfold, representing an absolute increase of 15% to 20%.
cost) The extent to which this hyperglycemia had consequences important to patients is uncertain. One
6.3.4.5 Presentation of resource use possible conclusion from this information is that, given the magnitude of reduction in deterioration and
6.3.4.6 Economic model lack of evidence suggesting important adverse effects, a benefit of even 0.65 days of reduced average
6.3.4.7 Consideration of resource use in hospitalization would warrant steroid administration. If this were the conclusion, the CI (0.65, 2.2) would
not cross the decisionmaking threshold and the guideline panel would proceed to consider whether the
recommendations
evidence meets the OIS criterion.
6.4.3 Providing transparent statements about 5.2.4.2 Imprecision in in systematic reviews
6.5 The EvidencetoDecision framework For authors of systematic reviews:
strategies Quality of evidence refers to one's confidence in the estimates of effect. In systematic reviews each
7.1. Questions about diagnostic tests outcome is considered separately.
7.1.1. Establishing the purpose of a test Authors of systematic reviews should not rate down quality due to imprecision on the basis of the trade
7.1.2. Establishing the role of a test off between desirable and undesirable consequences; it is not their job to make value and preference
7.1.3. Clear clinical questions judgments. Therefore, in judging precision, they should not focus on the threshold that represents the
7.2. Gold standard and reference test basis for a management decision. Rather, they should consider the optimal information size to make
judgements.
1. Overview of the GRADE Approach For dichotomous outcomes
approach We suggest that authors of systematic reviews consider the following steps in deciding whether to rate
1.2 Separation of confidence in effect estimates down the quality of evidence for imprecision:
from strength of recommendations 1. If the optimal information size criterion is not met, rate down for imprecision, unless the
1.3 Special challenges in applying the the GRADE sample size is very large (at least 2000, and perhaps 4000 patients).
approach 2. If the OIS criterion is met and the 95% CI excludes no effect (i.e. CI around RR excludes
1.4 Modifications to the GRADE approach 1.0), do not rate down for imprecision.
2.1 Defining the patient population and intervention 3. If OIS criterion is met, and the 95% CI overlaps no effect (i.e. CI includes RR of 1.0) rate
2.2 Dealing with multiple comparators down for imprecision if the CI fails to exclude important benefit or important harm. (see
2.3 Other considerations Example 8)
2.4 Format of health care questions using the Note:
GRADE approach
To be of optimal use to guideline developers, a systematic review may still point out what thresholds of
3. Selecting and rating the importance of outcomes benefit would mandate rating down for imprecision.
outcomes Example 8: Meeting threshold OIS may not ensure precision
3.2 Influence of perspective Although satisfying the OIS threshold in the presence of a CI excluding no effect indicates adequate
3.3 Using evidence in rating the importance of precision, the same is not true when the point estimate fails to exclude no effect.
outcomes
3.4 Surrogate (substitute) outcomes Consider the systematic review of β blockers in noncardiac surgery previously introduced in Example 6
above. For total mortality, with 295 deaths and a total sample size of over 10000, the point estimate and
95% CI for the RR with β blockers were 1.24 (95% CI: 0.99, 1.56). Despite the large sample size and
4.1 Evidence Tables number of events, one might be reluctant to conclude precision is adequate when a small reduction in
4.2 GRADE Evidence Profile mortality with β blockers, as well as an increase of 56%, remain plausible. This suggests that when the
4.3 Summary of Findings table OIS criteria are met, and the CI includes the null effect, systematic review authors should consider
5. Quality of evidence whether CIs include appreciable benefit or harm.
5.1.1 Study design Authors should use their judgment in deciding what constitutes appreciable benefit and harm and
5.2 Factors that can reduce the quality of the provide a rationale for their choice. If reviewers fail to find a compelling rationale for a threshold, our
suggested default threshold for appreciable benefit and harm that warrants rating down is an RRR or RR
evidence
increase of 25% or more.
5.2.2 Inconsistency of results For continuous outcomes
5.2.2.1 Deciding whether to use estimates Review authors can calculate the OIS for continuous variables in exactly the same way they can for
from a subgroup analysis binary variables by specifying the α and β error thresholds (we have suggested 0.05 and 0.2) and the Δ,
5.2.3 Indirectness of evidence and choosing an appropriate population standard deviation based on one of the relevant studies.
5.2.4 Imprecision
5.2.4.1 Imprecision in guidelines Whether you will rate down for imprecision is dependent on the choice of the difference (Δ) you wish
to detect and the resulting sample size required. Again, the merit of the GRADE approach is not that it
ensures agreement between reasonable individuals, but that the judgements being made are explicit.
reviews
5.2.4.3 Rating down two levels for Example 9: Judgements about imprecision depend on the choice of difference to detect
imprecision Consider the systematic review previously introduced in Example 7 above, which suggests that
5.2.5 Publication bias corticosteroid administration decreases the length of hospital stay in patients with exacerbations of
5.3. Factors that can increase the quality of the chronic obstructive pulmonary disease (COPD) by 1.42 days (95% CI: 0.65, 2.2).
evidence
5.3.1 Large magnitude of an effect Choosing a Δ of 1.0 (implying a judgment that reductions in stay of more than a day are important) and
using the standard deviation associated with hospital stay in the four relevant studies (3.4, 4.5, and 4.9)
yields corresponding required total sample sizes of 364, 636, and 754. The 602 patients available for this
5.3.3. Effect of plausible residual confounding analysis do not therefore meet the OIS criterion, and one would consider rating down for imprecision.
6. Going from evidence to recommendations Had we chosen a smaller difference (e.g. 0.5 days) that we wished to detect, the sample size of the
6.1 Recommendations and their strength studies would have been unequivocally insufficient. Had we chosen a larger value (e.g. 1.5 days) the
6.1.1 Strong recommendation sample size of 602 would have met the OIS criterion.
6.1.2 Weak recommendation Note: Outcomes reported as a standardized mean difference
A particular challenge in calculating the OIS for continuous variables arises when studies have used
only in research
different instruments to measure a construct, and the pooled estimate is calculated using a standardized
6.1.4 No recommendation mean difference. Systematic review and guideline authors will most often face this situation when
6.2 Factors determining direction and strength of dealing with patientreported outcomes, such as quality of life. In this context, we suggest authors
recommendations choose one of the available instruments (ideally, one in which an estimate of the minimally important
6.2.1 Balance of desirable and undesirable difference is available) and calculate an OIS using that instrument.
consequences
6.2.1.1 Estimates of the magnitude of the Because it may give false reassurance, we hesitate to offer a ruleofthumb threshold for the absolute
desirable and undesirable effects number of patients required for adequate precision for continuous variables. For example, using the usual
standards of α (0.05) and β (0.20), and an effect size of 0.2 standard deviations, representing a small
effect, requires a total sample size of approximately 400 (200 per group), sample size that may not be
preferences sufficient to ensure prognostic balance.
magnitude of effects (quality of evidence) Nonetheless, whenever there are sample sizes that are less than 400, review authors and guideline
6.3.3 Confidence in values and preferences developers should certainly consider rating down for imprecision. In future, statistical simulations may
6.3.4 Resource use (cost) provide the basis for a robust rule of thumb for continuous outcomes. The limitations of an arbitrary
6.3.4.1 Differences between costs and threshold sample size suggest the advisability of addressing precision by calculation of the relevant OIS
other outcomes for each continuous variable.
6.3.4.2 Perspective
6.3.4.4 Confidence in the estimates of 5.2.4.3 Rating down two levels for imprecision
cost) When there are very few events and CIs around both relative and absolute estimates of effect, that
6.3.4.5 Presentation of resource use include both appreciable benefit and appreciable harm, systematic reviewers and guideline developers
6.3.4.6 Economic model should consider rating down the quality of evidence by two levels.
recommendations Example 10: Rating down for imprecision by two levels
6.4 Presentation of recommendations A systematic review of the use of probiotics for induction of remission in Crohn’s disease found a single
6.4.1 Wording of recommandations randomized trial that included 11 patients. Four of five patients in the treatment group achieved
6.4.1 Wording of recommandations randomized trial that included 11 patients. Four of five patients in the treatment group achieved
6.3.2 Symbolic representation remission, and five of six patients in the control group achieved remission. The point estimate of the risk
6.4.3 Providing transparent statements about ratio (0.96) suggests no difference, but the CI includes a reduction in likelihood of remission of almost
assumed values and preferences half, or an increase in the likelihood of over 50% (95% CI: 0.56, 1.69). As there are few events and the
6.5 The EvidencetoDecision framework CI includes appreciable benefit and harm, one would rate down quality of evidence by two levels for
7. The GRADE approach for diagnostic tests and imprecision.
strategies
7.1.1. Establishing the purpose of a test 5.2.5 Publication bias
7.2. Gold standard and reference test Publication bias is a systematic underestimation or an overestimation of the underlying beneficial or
harmful effect due to the selective publication of studies. Confidence in the combined estimates of
1. Overview of the GRADE Approach effects from a systematic review can be reduced when publication bias is suspected, even when the
1.1 Purpose and advantages of the GRADE included studies themselves have a low risk of bias.
approach Note:
from strength of recommendations Some systems for assessing the quality of the body of evidence use the term “reporting bias” with 2
subcategories: selective outcome reporting and publication bias. However, GRADE considers selective
outcome reporting under risk of bias (study limitations) since it can be addressed in single studies. In
approach contrast, when an entire study remains unpublished (unreported), one can assess the likelihood of
1.4 Modifications to the GRADE approach publication bias only by looking at a group of studies. Currently, GRADE follows the Cochrane
2. Framing the health care question Collaboration’s approach and consider selective outcome reporting as an issue in risk of bias in
2.1 Defining the patient population and intervention individual studies (Cochrane Handbook. Chapter 8.5 The Cochrane Collaboration’s tool for assessing
2.2 Dealing with multiple comparators risk of bias).
2.4 Format of health care questions using the Empirical evidence suggests that studies reporting statistically significant findings are more likely to be
accepted for publication than those reporting statistically insignificant findings (“negative studies”).
GRADE approach
Publication bias arises when entire studies go unreported. Lack of success to identify studies is typically
3. Selecting and rating the importance of outcomes a result of studies either remaining unpublished or obscurely published (e.g. in journals with limited
3.1 Steps for considering the relative importance of circulation not indexed by major databases, as conference abstracts or theses), thus, methodologists have
outcomes labeled the phenomenon “publication bias.” Authors of systematic reviews may fail to identify studies
3.2 Influence of perspective that are unpublished or that have been published in a nonindexed, limitedcirculation journal or in the
3.3 Using evidence in rating the importance of grey literature even if they employ most rigorous search techniques. If rigorous search techniques are not
outcomes implemented it is difficult to make the judgement about publication bias since studies might remain
3.4 Surrogate (substitute) outcomes unidentified both because of publication bias or because of insufficient effort to identify them.
The risk of publication bias may be higher for systematic reviews of observational studies than for
4.1 Evidence Tables reviews of RCTs. This can occur, especially if observational studies are conducted automatically from
4.2 GRADE Evidence Profile patient registries or medical records. In these instances, it is difficult for the reviewer to know if the
4.3 Summary of Findings table observational studies that appear in the literature represent all or a fraction (usually those that showed
5. Quality of evidence “interesting” results) of the studies conducted.
5.1.1 Study design
5.2 Factors that can reduce the quality of the Table 5.8: Possible sources of publication bias throughout the publication process
evidence
Phases of research publication Actions contributing to or resulting in bias.
5.2.2 Inconsistency of results Preliminary and pilot studies Small studies more likely to be “negative” (e.g.
5.2.2.1 Deciding whether to use estimates those with discarded or failed hypotheses)
from a subgroup analysis remain unpublished; companies classify some as
5.2.3 Indirectness of evidence proprietary information.
5.2.4 Imprecision Report completion Authors decide that reporting a “negative” study
5.2.4.1 Imprecision in guidelines is uninteresting; and do not invest the time and
5.2.4.2 Imprecision in in systematic effort required for submission.
reviews
5.2.4.3 Rating down two levels for Journal selection Authors decide to submit the “negative” report to
imprecision a nonindexed, nonEnglish, or limitedcirculation
journal.
5.3. Factors that can increase the quality of the Editorial consideration Editor decides that the “negative” study does not
evidence warrant peer review and rejects manuscript.
5.3.1 Large magnitude of an effect Peer review Peer reviewers conclude that the “negative”
5.3.2. Doseresponse gradient study does not contribute to the field and
5.3.3. Effect of plausible residual confounding recommend rejecting the manuscript. Author
5.4 Overall quality of evidence gives up or moves to lower impact journal.
6. Going from evidence to recommendations Publication delayed.
6.1.1 Strong recommendation Author revision and resubmission Author of rejected manuscript decides to forgo
6.1.2 Weak recommendation the submission of the “negative” study or to
submit it again at a later time to another journal
(see “journal selection” above).
only in research
6.1.4 No recommendation Report publication Journal delays the publication of the “negative”
6.2 Factors determining direction and strength of study.
recommendations Proprietary interests lead to report getting
6.2.1 Balance of desirable and undesirable submitted to, and accepted by, different journals.
consequences
desirable and undesirable effects Studies with small sample sizes are more likely to remain unpublished or ignored. Discrepancies
6.2.1.2 Best estimates of values and between results of metaanalyses of small studies and subsequent large trials may occur as often as 20%
preferences of the time, and publication bias may be a major contributor to such discrepancies. Therefore, one should
6.3.2 Confidence in best estimates of suspect publication bias when published evidence is limited to a small number of small trials. This is
magnitude of effects (quality of evidence) especially true if many of these small studies show benefits of certain intervention.
6.3.3 Confidence in values and preferences Methods to detect the possibility of publication bias in systematic reviews include visual inspection and
6.3.4 Resource use (cost) tests for asymmetry of funnel plots (Cochrane Handbook. Chapter 10.4 Detecting reporting biases).
6.3.4.1 Differences between costs and Empirical examination of patterns of results may suggest publication bias if results are asymmetrical
other outcomes about the summary estimate of effect. This can be determined either through visual inspection of a
6.3.4.2 Perspective funnel plot (shown below) or from a positive result for a statistical test for asymmetry. As a ruleof
6.3.4.3 Resource implications considered thumb, funnel plot and statistical tests for asymmetry should be used to detect publication bias if there
6.3.4.4 Confidence in the estimates of are at least 10 studies included in the metaanalysis (some say at lest 5 studies).
resource use (quality of the evidence about Another test used to detect publication bias is referred to as the “trim and fill” method is an extension of
cost) the funnel plot. This trim and fill technique begins by removing small “positive” studies that do not have
6.3.4.5 Presentation of resource use a negative counterpart, leaving a symmetrical funnel plot. The new supposed true effect is then
6.3.4.6 Economic model calculated using the effects of the studies included in the new funnel plot. The next step is to add
6.3.4.7 Consideration of resource use in hypothetical studies which mirror the results of the positive studies, but still retains the new pooled
recommendations effect estimate. It is important to note that even if asymmetry is detected, it may not be the result of
6.4 Presentation of recommendations publication bias. For example, in smaller studies, overestimates of effect may yield an asymmetric
6.4.1 Wording of recommandations funnel plot that could be explained by limitations other than publication bias such as a restrictive study
6.3.2 Symbolic representation population. To strengthen conclusions regarding publication bias it is recommended that multiple tests be
6.4.3 Providing transparent statements about used.
assumed values and preferences Recursive cumulative metaanalysis, used to detect lag time bias, performs a metaanalysis at the end of
6.5 The EvidencetoDecision framework each year, noting changes in effect estimates for each progressing year. If effects of an intervention
6.5 The EvidencetoDecision framework each year, noting changes in effect estimates for each progressing year. If effects of an intervention
7. The GRADE approach for diagnostic tests and continuously decrease, there is a strong indication of lag time bias.
strategies Regardless of the test used, review authors and guideline developers should be aware such tests can be
7.1. Questions about diagnostic tests prone to error and their results should be interpreted with caution. It is extremely difficult to be
7.1.1. Establishing the purpose of a test confident that publication bias is absent and almost as difficult to place a threshold on when to rate down
7.1.2. Establishing the role of a test quality of evidence due to the strong suspicion of publication bias. For this reason GRADE suggests
7.1.3. Clear clinical questions rating down quality of evidence for publication bias by a maximum of one level.
1. Overview of the GRADE Approach Example 1: Trials with positive findings (i.e. statistically significant differences) are more likely to
1.1 Purpose and advantages of the GRADE be published than trials with negative or null findings
approach A systematic review assessed the extent to which publication of a cohort of clinical trials is influenced
1.2 Separation of confidence in effect estimates by the statistical significance, perceived importance, or direction of their results. It found five studies
from strength of recommendations that investigated these associations in a cohort of registered clinical trials. Trials with positive findings
1.3 Special challenges in applying the the GRADE were more likely to be published than trials with negative or null findings (odds ratio: 3.9; 95% CI: 2.7 to
approach 5.7). This corresponds to a risk ratio of 1.8 (95% CI: 1.6 to 2.0), assuming that 41% of negative trials are
1.4 Modifications to the GRADE approach published (the median among the included studies, range = 11% to 85%). In absolute terms, this means
2. Framing the health care question that if 41% of negative trials are published, we would expect that 73% of positive trials would be
2.1 Defining the patient population and intervention published. Two studies assessed time to publication and showed that trials with positive findings tended
2.2 Dealing with multiple comparators to be published after 4 to 5 years compared with those with negative findings, which were published
after 6 to 8 years. Three studies found no statistically significant association between sample size and
publication. One study found no statistically significant association between either funding mechanism,
2.4 Format of health care questions using the investigator rank, or sex and publication.
GRADE approach
3. Selecting and rating the importance of outcomes Systematic reviews performed early in the development of a body of research may be biased due to the
3.1 Steps for considering the relative importance of tendency for positive results to be published sooner and for negative results to be published later or
outcomes withheld. This is referred to as “lag bias” and especially true of industry funded studies.
3.2 Influence of perspective Example 3: Reduced effect estimate in a systematic review as a result of negative studies not being
3.3 Using evidence in rating the importance of published
outcomes
An investigation of 74 antidepressant trials with a mean sample size of fewer than 200 patients was
3.4 Surrogate (substitute) outcomes submitted to the FDA. Of the 38 studies viewed as positive by the FDA, 37 were published. Of the 36
4. Summarizing the evidence studies viewed as negative by the FDA, only 14 were published. Publication bias of this magnitude can
4.1 Evidence Tables seriously bias effect estimates.
4.3 Summary of Findings table Example 5: Funnel plots to detect publication bias
5. Quality of evidence In A, the circles represent the point estimates of the trials. The pattern of distribution resembles an
5.1 Factors determining the quality of evidence inverted funnel. Larger studies tend to be closer to the pooled estimate (the dashed line). In this case, the
5.1.1 Study design effect sizes of the smaller studies are more or less symmetrically distributed around the pooled estimate.
5.2 Factors that can reduce the quality of the In B, publication bias is detected. This funnel plot shows that the smaller studies are not symmetrically
evidence distributed around either the point estimate (dominated by the larger trials) or the results of the larger
5.2.1 Study limitations (Risk of Bias) trials themselves. The trials expected in the bottom right quadrant are missing. One possible explanation
5.2.2 Inconsistency of results for this set of results is publication bias an overestimate of the treatment effect relative to the
5.2.2.1 Deciding whether to use estimates underlying truth.
5.2.4 Imprecision
reviews
imprecision
evidence
only in research
recommendations
consequences
desirable and undesirable effects Example 6: Publication bias detected
A number of small trials from a systematic review of oxygen therapy in patients with chronic obstructive
preferences pulmonary disease showed that the intervention improved exercise capacity, but evaluation of the data
6.3.2 Confidence in best estimates of suggested publication bias.
6.3.3 Confidence in values and preferences The funnel plot of exercise distance shows distance on the xaxis and variance on the yaxis. The red
6.3.4 Resource use (cost) dots represent the mean differences of individual trial estimates and the dotted line the point estimate of
6.3.4.1 Differences between costs and the mean effect indicating benefit from oxygen treatment. The distribution of these dots to the right of the
other outcomes dotted line suggests that there may be the equivalent number of ’negative’ trials that have not been
included in this analysis. Thus, one may downgrade the quality of evidence in this case due to
6.3.4.2 Perspective
uncertainty resulting from asymmetry in the pattern of results.
cost)
recommendations
strategies
Example 8: Publication bias undetected
1.1 Purpose and advantages of the GRADE A systematic review of parenteral anticoagulation for prolonged survival in patients with cancer who had
approach no other indication for anticoagulation shows five RCTs which are symmetrically distributed around the
1.2 Separation of confidence in effect estimates best estimate of effect. Publication bias is undetected in this scenario and thus the evidence should not
from strength of recommendations be downgraded.
approach
GRADE approach
outcomes
outcomes
4.1 Evidence Tables
5.1.1 Study design
5.2 Factors that can reduce the quality of the When to downgrade the quality of evidence because of suspicion of publication bias
evidence Guideline panels and authors of systematic reviews should consider the extent to which they are
5.2.1 Study limitations (Risk of Bias) uncertain about the magnitude of the effect due to selective publication of studies and they may
5.2.2 Inconsistency of results downgrade the quality of evidence by one level. Consider:
from a subgroup analysis ● study design (experimental vs. observational)
5.2.3 Indirectness of evidence ● study size (small studies vs. large studies)
5.2.4 Imprecision
● lag bias (early publication of positive results)
5.2.4.2 Imprecision in in systematic ● search strategy (was it comprehensive?)
reviews ● asymmetry in funnel plot.
imprecision
evidence 5.3. Factors that can increase the quality of the evidence
5.3.3. Effect of plausible residual confounding Note: Consideration of factors reducing quality of evidence must precede consideration of reasons
5.4 Overall quality of evidence for rating it up. Thus, the 5 factors for rating down quality of evidence (risk of bias, imprecision,
6. Going from evidence to recommendations inconsistency, indirectness, and publication bias) must be rated prior to the 3 factors for rating it up
6.1 Recommendations and their strength (large effect, doseresponse and effects of residual confounding). The decision to rate up quality of
evidence should only be made when serious limitations in any of the 5 areas reducing the quality of
evidence are absent.
6.1.3 Recommendations to use interventions The following sections discuss in detail the 3 factors that permit rating up the quality of evidence, i.e.
only in research increase confidence in an estimate of an effect. Using the GRADE framework, body of evidence from
6.1.4 No recommendation observational studies is initially classified as low quality evidence (i.e. permitting low confidence in the
6.2 Factors determining direction and strength of estimated effect). There are times, however, when we have high confidence in the estimate of effect
recommendations from observational studies (including cohort, casecontrol, beforeafter, time series studies, etc.) and to
6.2.1 Balance of desirable and undesirable nonrandomized experimental studies (e.g. quasirandomized and nonrandomized controlled trials). The
circumstances under which the body of evidence from observational studies may provide higher than low
consequences
confidence in the estimated effects will likely occur infrequently.
desirable and undesirable effects Note: Although it is theoretically possible to rate up results from randomized control trials, we have yet
6.2.1.2 Best estimates of values and to find a compelling example of such an instance.
preferences
6.3.3 Confidence in values and preferences 5.3.1 Large magnitude of an effect
6.3.4.1 Differences between costs and When body of evidence from observational studies not downgraded for any of the 5 factors yield large or
other outcomes very large estimates of the magnitude of an intervention effect, then we may be more confident about the
6.3.4.2 Perspective results. In those situations, even though observational studies are likely to provide an overestimate of the
6.3.4.3 Resource implications considered true effect, the study design that is more prone to bias is unlikely to explain all of the apparent benefit (or
6.3.4.4 Confidence in the estimates of harm). Decisions to rate up quality of evidence because of large or very large effects (Table 5.9) should
resource use (quality of the evidence about consider not only the point estimate but also the precision (width of the CI) around that effect: one should
cost) rarely and very cautiously rate up quality of evidence because of apparent large effects, if the CI
6.3.4.5 Presentation of resource use overlaps substantially with effects smaller than the chosen threshold of clinical importance.
recommendations
Table 5.9. Definitions of large and very large effect
6.4.1 Wording of recommandations Magnitude of Effect Quality of Evidence
6.3.2 Symbolic representation Definition
6.4.3 Providing transparent statements about Large RR* >2 or <0.5 may increase 1 level
6.5 The EvidencetoDecision framework (based on direct evidence, with
7. The GRADE approach for diagnostic tests and no plausible confounders)
strategies Very large RR* >5 or <0.2 may increase 2 levels
(based on direct evidence with
7.1.1. Establishing the purpose of a test no serious problems with risk of
7.1.2. Establishing the role of a test bias or precision, i.e. with
7.2. Gold standard and reference test (sufficiently narrow confidence
intervals)
1. Overview of the GRADE Approach * Note: these rules apply when effect measure is expressed as relative risk (RR) or hazard ratio (HR).
1.1 Purpose and advantages of the GRADE They cannot always be applied when the effect measure is expressed as odds ratio (OR). We suggest
approach converting OR to RR and only then assessing the magnitude of an effect.
1.3 Special challenges in applying the the GRADE One may be more likely to rate up the quality of evidence because of large or very large magnitude of an
approach effect, when:
1.4 Modifications to the GRADE approach ● effect is rapid
2.1 Defining the patient population and intervention ● effect is consistent across subjects
2.2 Dealing with multiple comparators ● previous trajectory of disease is reversed
2.4 Format of health care questions using the ● large magnitude of an effect is supported by indirect evidence
GRADE approach
3. Selecting and rating the importance of outcomes Note: When outcomes are subjective it is important to be cautious when considering upgrading because
3.1 Steps for considering the relative importance of of observed large effects. This is especially true when outcome assessors were aware which group study
outcomes subjects belonged to (i.e. were not blinded).
3.3 Using evidence in rating the importance of Examples
outcomes A systematic review of observational studies examining the relationship between infant sleeping position
3.4 Surrogate (substitute) outcomes and sudden infant death syndrome (SIDS) found an odds ratio of 4.1 (95% CI: 3.1, 5.5) of SIDS
4. Summarizing the evidence occurring with front vs. back sleeping positions. Furthermore, “back to sleep” campaigns that were
4.1 Evidence Tables started in the 1980s to encourage back sleeping position were associated with a relative decline in the
4.2 GRADE Evidence Profile incidence of SIDS by 5070% in numerous countries.
5.1.1 Study design 5.3.2. Doseresponse gradient
evidence The presence of a doseresponse gradient has long been recognized as an important criterion for
5.2.1 Study limitations (Risk of Bias) believing a putative causeeffect relationship. The presence of a doseresponse gradient may increase
5.2.2 Inconsistency of results our confidence in the findings of observational studies and thereby increase the quality of evidence.
5.2.3 Indirectness of evidence Example 1: Doseresponse gradient (Upgraded by One Level)
5.2.4 Imprecision
The observation that, in patients receiving anticoagulation with warfarin, there is a dose response
5.2.4.1 Imprecision in guidelines gradient between higher levels of the international normalized ratio (INR), an indicator of the degree of
5.2.4.2 Imprecision in in systematic anticoagulation, and an increased risk of bleeding increases our confidence that supratherapeutic
reviews anticoagulation levels increase bleeding risk.
imprecision
5.2.5 Publication bias Example 2: Doseresponse gradient (Upgraded by One Level)
The doseresponse gradient associated with the rapidity of antibiotic administration in patients presenting
evidence with sepsis and hypotension may also be a reason to upgrade the quality of evidence for such a study.
5.3.1 Large magnitude of an effect There is a large absolute increase in mortality with each hour’s delay of antibiotic administration. This
5.3.2. Doseresponse gradient doseresponse relationship increases our confidence that the effect on mortality is real and substantial
5.3.3. Effect of plausible residual confounding leading to upgrading the quality of the evidence.
only in research On occasion, all plausible residual confounding from observational studies may be working to reduce
6.1.4 No recommendation the demonstrated effect or increase the effect, if no effect was observed.
6.2 Factors determining direction and strength of Rigorous observational studies will accurately measure prognostic factors associated with the outcome
recommendations of interest and will conduct an adjusted analysis that accounts for differences in the distribution of these
6.2.1 Balance of desirable and undesirable factors between intervention and control groups. The reason that in most instances we consider
consequences observational studies as providing only lowquality evidence is that unmeasured or unknown
6.2.1.1 Estimates of the magnitude of the determinants of outcome unaccounted for in the adjusted analysis are likely to be distributed
desirable and undesirable effects unequally between intervention and control groups, referred to as “residual confounding” or “residual
6.2.1.2 Best estimates of values and biases.”
preferences On occasion, all plausible confounders (biases) from observational studies unaccounted for in the
6.3.2 Confidence in best estimates of adjusted analysis (i.e. residual confounders) of a rigorous observational study would result in an
magnitude of effects (quality of evidence) underestimate of an apparent treatment effect. If, for instance, only sicker patients receive an
6.3.3 Confidence in values and preferences experimental intervention or exposure, yet they still fare better, it is likely that the actual intervention or
6.3.4 Resource use (cost) exposure effect is even larger than the data suggest. A parallel situation exists when observational
6.3.4.1 Differences between costs and studies have failed to demonstrate an association.
other outcomes Example 1: When confounding is expected to reduce a demonstrated effect (Upgraded by One Level)
6.3.4.2 Perspective
6.3.4.3 Resource implications considered A rigorous systematic review of observational studies including a total of 38 million patients
6.3.4.4 Confidence in the estimates of demonstrated higher death rates in private forprofit versus private notforprofit hospitals. It is likely,
resource use (quality of the evidence about however, that patients in the notforprofit hospitals were sicker than those in the forprofit hospitals.
This would bias results against the notforprofit hospitals. The second likely bias was the possibility that
cost)
higher numbers of patients with excellent private insurance coverage could lead to a hospital having
6.3.4.5 Presentation of resource use more resources and a spillover effect that would benefit those without such coverage. Since forprofit
6.3.4.6 Economic model hospitals are likely to admit a larger proportion of such wellinsured patients than notforprofit hospitals,
6.3.4.7 Consideration of resource use in the bias is once again against the notforprofit hospitals. Because the plausible biases would all
recommendations diminish the demonstrated intervention effect, one might consider the evidence from these observational
6.4 Presentation of recommendations studies as moderate rather than low quality.
6.3.2 Symbolic representation Example 2: When confounding is expected to reduce a demonstrated effect (Upgraded by One Level)
6.4.3 Providing transparent statements about In a systematic review investigating the use of condoms in homosexual male relationships as a way of
assumed values and preferences preventing the spread of HIV, five observational studies were identified. The pooled estimate was a
6.5 The EvidencetoDecision framework relative risk of 0.34 (95%, 0.21 – 0.54) in favour of condom use. The authors failed to adjust in the
7. The GRADE approach for diagnostic tests and analysis for the fact that condom users are more likely to have more partners than noncondom users.
strategies One would expect that more partners would have increased the risk of acquiring HIV and therefore
7.1. Questions about diagnostic tests reduced the resulting relative risk of HIV infection. Therefore, the confidence in this effect, which is
7.1.1. Establishing the purpose of a test still large, would lead to upgrading by one level.
7.1.2. Establishing the role of a test Example 3: When confounding is expected to increase the effect but no effect was observed (Upgraded
7.1.3. Clear clinical questions by One Level)
7.2. Gold standard and reference test The hypoglycaemic drug phenformin causes lactic acidosis, and the related agent metformin is under
suspicion for the same toxicity. Very large observational studies have failed to demonstrate an
1. Overview of the GRADE Approach association between metformin and lactic acidosis. Given the likelihood that clinicians would have been
1.1 Purpose and advantages of the GRADE more alert to lactic acidosis with metformin and would have therefore overreported its occurrence, and
approach that no association was found, one could upgrade this evidence.
approach that no association was found, one could upgrade this evidence.
1.2 Separation of confidence in effect estimates Example 4: When confounding is expected to increase the effect but no effect was observed (Upgraded
from strength of recommendations by One Level)
approach Consider the early reports associating MMR vaccination with autism. One would think that there would
1.4 Modifications to the GRADE approach be overreporting of autism in children given MMR vaccines. However, systematic reviews failed to
2. Framing the health care question prove any association between the two. Due to the negative results, despite the potential presence of
confounders which would increase the likelihood of reporting of autism, no association was found.
Therefore, we may upgrade the level of evidence by one level.
GRADE approach
3. Selecting and rating the importance of outcomes 5.4 Overall quality of evidence
outcomes
3.2 Influence of perspective The overall quality of evidence is a combined rating of the quality of evidence across all outcomes
3.3 Using evidence in rating the importance of considered critical for answering a health care question (i.e. making a decision or a recommendation).
outcomes We caution against a mechanistic approach toward the application of the criteria for rating the quality of
3.4 Surrogate (substitute) outcomes the evidence up or down. Although GRADE suggests the initial separate consideration of five categories
4. Summarizing the evidence of reasons for rating down the quality of evidence, and three categories for rating it up, with a yes/no
4.1 Evidence Tables decision regarding rating up or down in each case, the final rating of overall evidence quality occurs in a
4.2 GRADE Evidence Profile continuum of confidence in the estimates of effects.
5.1 Factors determining the quality of evidence For authors of systematic reviews:
5.1.1 Study design Authors of systematic reviews do not grade the overall quality of evidence across outcomes. Because
5.2 Factors that can reduce the quality of the systematic reviews do not – or at least should not – make recommendations, authors of systematic
evidence reviews rate the quality of evidence only for each outcome separately.
5.2.2.1 Deciding whether to use estimates For guideline panels and others making recommendations:
from a subgroup analysis Guideline panels have to determine the overall quality of evidence across all the critical outcomes
5.2.3 Indirectness of evidence essential to a recommendation they make. Guideline panels provide a single grade of quality of evidence
5.2.4 Imprecision for every recommendation, but the strength of a recommendation usually depends on evidence regarding
5.2.4.1 Imprecision in guidelines not just one, but a number of patientimportant outcomes and on the quality of evidence for each of these
5.2.4.2 Imprecision in in systematic outcomes.
reviews Because the GRADE approach rates quality of evidence separately for each outcome, it is frequently
5.2.4.3 Rating down two levels for the case that quality differs across outcomes. When determining the overall quality of evidence across
imprecision outcomes:
1. Consider only those outcomes that have been deemed critical.
evidence 2. If the quality od evidence is the same for all critical outcomes, then this becomes the overall
5.3.1 Large magnitude of an effect quality of the evidence supporting the answer to the question.
5.3.2. Doseresponse gradient 3. If the quality of evidence differs across critical outcomes, it is logical that the overall
5.3.3. Effect of plausible residual confounding confidence in effect estimates cannot be higher than the lowest confidence in effect estimates
5.4 Overall quality of evidence for any outcome that is critical for a decision. Therefore, the lowest quality of evidence for any
6. Going from evidence to recommendations of the critical outcomes determines the overall quality of evidence.
Example 1: Rating overall quality of evidence based on the importance of outcomes
6.1.2 Weak recommendation Several systematic reviews of highquality randomised trials suggest a decrease in the incidence of
6.1.3 Recommendations to use interventions infections and, likely, the mortality of ventilated patients in intensive care units receiving selective
only in research digestive decontamination (SDD). The quality of evidence on the effect of SDD on the emergence of
6.1.4 No recommendation bacterial antibiotic resistance and its clinical relevance is much less clear. One might reasonably grade
6.2 Factors determining direction and strength of the evidence about this feared potential adverse effect as low quality. If those making a recommendation
recommendations felt that these downsides of therapy were critical, the overall grade of the quality of evidence for SDD
6.2.1 Balance of desirable and undesirable would be low. If guideline panel felt that the emergence of bacterial antibiotic resistance was important
but not critical, the grade for an overall quality of evidence would be high.
consequences
desirable and undesirable effects However, which outcomes are critical may depend on the evidence. On occasion, the overall confidence
6.2.1.2 Best estimates of values and in effect estimates may not come from the outcomes judged critical at the beginning of the guideline
preferences development process – judgments about which outcomes are critical to the decision (recommendation)
6.3.2 Confidence in best estimates of may change when considering the results. Note that such judgments require careful consideration and
magnitude of effects (quality of evidence) are probably rare.
There are 2 prototypical situations in which an outcome initially considered critical may cease to be
6.3.4 Resource use (cost) critical once the evidence is summarized:
other outcomes 1. An outcome turns out to be not relevant (e.g. a particular adverse event may be considered
6.3.4.2 Perspective critical at the outset of the guideline process but, if it turns out that the event occurs very
6.3.4.3 Resource implications considered infrequently, the final decision may be that this adverse effect is important but not critical to the
6.3.4.4 Confidence in the estimates of recommendation).
resource use (quality of the evidence about 2. An outcome turns out to be not necessary if, across the range of possible effects of the
cost) intervention on that outcome, the recommendation and its strength would remain unchanged. If
6.3.4.5 Presentation of resource use there is higher quality of evidence for some critical outcomes to support a decision, then one
6.3.4.6 Economic model need not rate down quality of evidence because of lower confidence in estimates of effects on
6.3.4.7 Consideration of resource use in other critical outcomes that support the same recommendation.
recommendations For instance, consider the following question: should statins vs. no statins be used in individuals
6.4 Presentation of recommendations without documented coronary heart disease but at high risk of cardiovascular events? Guideline
6.4.1 Wording of recommandations developers are likely to start the process by considering outcomes: death from cardiovascular
6.3.2 Symbolic representation causes, myocardial infarction, stroke, and adverse effects, as critical to the decision.
A systematic review or randomized trials demonstrated consistent reductions in myocardial
assumed values and preferences infarctions and stroke but nonsignificant reductions in coronary deaths. Serious adverse effects were
6.5 The EvidencetoDecision framework unusual and readily reversible with drug discontinuation. The guideline authors found that for three
7. The GRADE approach for diagnostic tests and of the four outcomes (myocardial infarction, stroke, and adverse effects) there was high quality
strategies evidence. For coronary deaths evidence was of moderate quality because of imprecision.
7.1.1. Establishing the purpose of a test Should the overall quality of evidence across outcomes be high or moderate? The judgments made at
7.1.2. Establishing the role of a test the beginning of the process suggest that the answer is "moderate". However, once it is established
that the risk of myocardial infarction and stroke decreases with statins, most people would find
7.2. Gold standard and reference test compelling reason to use statins. Knowing whether coronary mortality also decreases is no longer
necessary for the decision (as long as it is very unlikely that it increases). Considering this, the
1. Overview of the GRADE Approach overall rating of quality of evidence is most appropriately designated as "high".
approach
approach 6. Going from evidence to recommendations
approach 6. Going from evidence to recommendations
2.4 Format of health care questions using the The strength of a recommendation reflects the extent to which a guideline panel is confident that
GRADE approach desirable effects of an intervention outweigh undesirable effects, or vice versa, across the range of
3. Selecting and rating the importance of outcomes patients for whom the recommendation is intended.
3.1 Steps for considering the relative importance of GRADE specifies two categories of the strength of a recommendation. While GRADE suggests using
outcomes the terms strong and weak recommendations, those making recommendations may choose different
3.2 Influence of perspective wording to characterize the two categories of strength.
outcomes In special cases, guideline panels may recommend an intervention be used only in research until more
data is generated, which would allow for a more comprehensive recommendation, or not make a
recommendation at all.
4.1 Evidence Tables
4.2 GRADE Evidence Profile There are limitations to formal grading of recommendations. Like the quality of evidence, the balance
4.3 Summary of Findings table between desirable and undesirable effects reflects a continuum. Some arbitrariness will therefore be
5. Quality of evidence associated with placing particular recommendations in categories such as “strong” and “weak.” Most
5.1 Factors determining the quality of evidence organisations producing guidelines have decided that the merits of an explicit grade of recommendation
5.1.1 Study design outweigh the disadvantages.
evidence
5.2.4 Imprecision
reviews For a guideline panel or others making recommendations to offer a strong recommendation they have to
5.2.4.3 Rating down two levels for be certain about the various factors that influence the strength of a recommendation. The panel also
imprecision should have the relevant information at hand that supports a clear balance towards either the desirable
5.2.5 Publication bias effects of an intervention (to recommend an action) or undesirable effects (to recommend against an
5.3. Factors that can increase the quality of the action).
evidence When a guideline panel is uncertain whether the balance is clear or when the relevant information
5.3.1 Large magnitude of an effect about the various factors that influence the strength of a recommendation is not available, a guideline
5.3.2. Doseresponse gradient panel should be more cautious and in most instances it would opt to make a weak recommendation.
5.4 Overall quality of evidence Figure 3: Balance scales to depict strong vs. weak recommendations.
only in research
recommendations
consequences
6.2.1.2 Best estimates of values and To aid interpretation GRADE suggests implications of strong or weak recommendations that follow from
preferences the recommendations. The advantage of two categories of strength of recommendations is that they
6.3.2 Confidence in best estimates of provide clear direction to patients, clinicians, and policymakers.
6.3.3 Confidence in values and preferences Table 6.1. Implications of strong and weak recommendations for different users of guidelines
6.3.4.1 Differences between costs and Strong Recommendation Weak Recommendation
other outcomes For patients Most individuals in this situation The majority of individuals in
6.3.4.2 Perspective would want the recommended this situation would want the
6.3.4.3 Resource implications considered course of action and only a suggested course of action, but
6.3.4.4 Confidence in the estimates of small proportion would not. many would not.
cost) For clinicians Most individuals should receive Recognize that different choices
the recommended course of will be appropriate for different
action. Adherence to this patients, and that you must help
6.3.4.6 Economic model recommendation according to each patient arrive at a
6.3.4.7 Consideration of resource use in the guideline could be used as a management decision consistent
recommendations quality criterion or performance with her or his values and
6.4 Presentation of recommendations indicator. Formal decision aids preferences. Decision aids may
6.4.1 Wording of recommandations are not likely to be needed to well be useful helping
6.3.2 Symbolic representation help individuals make decisions individuals making decisions
6.4.3 Providing transparent statements about consistent with their values and consistent with their values and
assumed values and preferences preferences. preferences. Clinicians should
6.5 The EvidencetoDecision framework expect to spend more time with
7. The GRADE approach for diagnostic tests and patients when working towards
strategies a decision.
7.1. Questions about diagnostic tests For policy makers The recommendation can be Policy making will require
7.1.1. Establishing the purpose of a test adapted as policy in most substantial debates and
7.1.2. Establishing the role of a test situations including for the use involvement of many
7.1.3. Clear clinical questions as performance indicators. stakeholders. Policies are also
7.2. Gold standard and reference test more likely to vary between
regions. Performance indicators
1. Overview of the GRADE Approach would have to focus on the fact
1.1 Purpose and advantages of the GRADE that adequate deliberation about
approach the management options has
1.2 Separation of confidence in effect estimates taken place.
Individualization of clinical decisionmaking in weak recommendations remains a challenge. Although
approach clinicians always should consider patients’ preferences and values, when they face weak
1.4 Modifications to the GRADE approach recommendations they may have more detailed conversations with patients than for strong
2. Framing the health care question recommendations to ensure that the ultimate decision is consistent with the patient’s preferences and
2.1 Defining the patient population and intervention values.
2.3 Other considerations Important Note:
2.4 Format of health care questions using the Clinicians, patients, thirdparty payers, institutional review committees, other stakeholders, or the courts
GRADE approach should never view recommendations as dictates. Even strong recommendations based on highquality
3. Selecting and rating the importance of outcomes evidence will not apply to all circumstances and all patients.
3.1 Steps for considering the relative importance of Users of guidelines may reasonably conclude that following some strong recommendations based on the
outcomes high quality evidence will be a mistake for some patients. No clinical practice guideline or
3.2 Influence of perspective recommendation can take into account all of the often compelling unique features of individual patients
3.3 Using evidence in rating the importance of and clinical circumstances. Thus, nobody charged with evaluating clinician’s actions, should attempt to
outcomes apply recommendations by rote or in a blanket fashion.
4.1 Evidence Tables
4.2 GRADE Evidence Profile 6.1.1 Strong recommendation
5.1 Factors determining the quality of evidence A strong recommendation is one for which guideline panel is confident that the desirable effects of an
5.1.1 Study design intervention outweigh its undesirable effects (strong recommendation for an intervention) or that the
5.2 Factors that can reduce the quality of the undesirable effects of an intervention outweigh its desirable effects (strong recommendation against an
intervention).
evidence
5.2.1 Study limitations (Risk of Bias) Note: Strong recommendations are not necessarily high priority recommendations.
5.2.2 Inconsistency of results A strong recommendation implies that most or all individuals will be best served by the recommended
5.2.2.1 Deciding whether to use estimates course of action.
5.2.3 Indirectness of evidence Example 1: Sample strong recommendations
5.2.4 Imprecision ● Early anticoagulation in patients with deep venous thrombosis for the prevention of pulmonary
5.2.4.1 Imprecision in guidelines embolism;
● Antibiotics for the treatment of community acquired pneumonia;
reviews
5.2.4.3 Rating down two levels for ● Quitting smoking to prevent adverse consequences of tobacco smoke exposure;
imprecision ● Use of bronchodilators in patients with known COPD
evidence
5.4 Overall quality of evidence A weak recommendation is one for which the desirable effects probably outweigh the undesirable
6. Going from evidence to recommendations effects (weak recommendation for an intervention) or undesirable effects probably outweigh the
6.1 Recommendations and their strength desirable effects (weak recommendation against an intervention) but appreciable uncertainty exists.
6.1.1 Strong recommendation A weak recommendation implies that not all individuals will be best served by the recommended course
6.1.2 Weak recommendation of action. There is a need to consider more carefully than usual the individual patient’s circumstances,
6.1.3 Recommendations to use interventions preferences, and values. When there are weak recommendations caregivers need to allocate more time
only in research to shared decision making, making sure that they clearly and comprehensively explain the potential
6.1.4 No recommendation benefits and harms to a patient.
recommendations
6.2.1 Balance of desirable and undesirable Alternative names for weak recommendations
consequences Some have been concerned with the term “weak recommendation” experiencing an unintended negative
6.2.1.1 Estimates of the magnitude of the connotation with the word “weak”, often also confusing it with “weak” evidence. To avoid confusion,
desirable and undesirable effects weak recommendations can instead be described using the terms:
6.2.1.2 Best estimates of values and ● conditional (depending on patient values, resources available or setting)
preferences
6.3.2 Confidence in best estimates of ● discretionary (based on opinion of patient or practitioner)
magnitude of effects (quality of evidence) ● qualified (by an explanation regarding the issues which would lead to different decisions).
6.3.4 Resource use (cost) If any variations are used it is essential that authors exercise consistency across all recommendation in a
guideline and across all guidelines they produce.
other outcomes
6.3.4.2 Perspective
resource use (quality of the evidence about 6.1.3 Recommendations to use interventions only in research
cost)
6.3.4.5 Presentation of resource use Promising interventions (usually new ones) with thus far insufficient evidence of benefit to support their
6.3.4.6 Economic model use may be associated with appreciable harms or costs. Decision makers may worry about providing
6.3.4.7 Consideration of resource use in premature favorable recommendations for their use, encouraging the rapid diffusion of potentially
recommendations ineffective or harmful interventions, and preventing recruitment to research already under way. They
6.4 Presentation of recommendations may be equally reluctant to recommend against such interventions out of fear that they will inhibit further
6.4.1 Wording of recommandations investigation. By making recommendations for use of an intervention only in the context of research they
6.3.2 Symbolic representation may provide an important stimulus to efforts to answer important research questions, thus resolving
6.4.3 Providing transparent statements about uncertainty about optimal management.
assumed values and preferences Recommendations for using interventions only in research are appropriate when three conditions are
6.5 The EvidencetoDecision framework met:
strategies 1. There is thus far insufficient evidence to support a decision for or against an intervention
7.1. Questions about diagnostic tests 2. Further research has large potential for reducing uncertainty about the effects of the
7.1.1. Establishing the purpose of a test intervention
3. Further research is thought to be of good value for the anticipated costs.
7.2. Gold standard and reference test Recommendations for using interventions only in research should be accompanied by detailed
suggestions about the specific research questions that should be addressed, particularly which patient
1. Overview of the GRADE Approach important outcomes they should measure. The recommendation for research may be accompanied by an
1.1 Purpose and advantages of the GRADE explicit strong recommendation not to use the experimental intervention outside of the research context.
approach
1.3 Special challenges in applying the the GRADE 6.1.4 No recommendation
approach
2.1 Defining the patient population and intervention There are 3 reasons for which those making recommendations may be reluctant to make a
2.2 Dealing with multiple comparators recommendation for or against a particular management strategy, and also conclude that a
recommendation to use the intervention only in research is not appropriate.
2.4 Format of health care questions using the 1. The confidence in effect estimates is so low that the panels feel a recommendation is too
GRADE approach speculative (see the US Preventative Services Task Force discussion on the topic [Petitti 2009;
3. Selecting and rating the importance of outcomes PMID: 19189910].
3. Selecting and rating the importance of outcomes PMID: 19189910].
3.1 Steps for considering the relative importance of 2. Irrespective of the confidence in effect estimates, the tradeoffs are so closely balanced, and
outcomes the values and preferences and resource implications not known or too variable, that the panel
3.2 Influence of perspective has great difficulty deciding on the direction of a recommendation.
3. Two management options have very different undesirable consequences, and individual
outcomes
patients’ reactions to these consequences are likely to be so different that it makes little sense to
3.4 Surrogate (substitute) outcomes think about typical values and preferences.
4.1 Evidence Tables The third reason requires an explanation. Consider adult patients with thalassemia major considering
4.2 GRADE Evidence Profile hematopoietic cell transplantation (possibility of cure but an early mortality risk of 33%) vs. continued
4.3 Summary of Findings table medical treatment with transfusion and iron chelation (continued morbidity and an uncertain prognosis).
5. Quality of evidence A guideline panel may consider that in such situations the only sensible recommendation is a discussion
5.1 Factors determining the quality of evidence between patient and physician to ascertain the patient’s preferences.
5.1.1 Study design Users of guidelines, however, may be frustrated with the lack of guidance when the guideline panel fails
5.2 Factors that can reduce the quality of the to make a recommendation. The USPSTF states: "Decision makers do not have the luxury of waiting for
evidence certain evidence. Even though evidence is insufficient, the clinician must still provide advice, patients
5.2.1 Study limitations (Risk of Bias) must make choices, and policy makers must establish policies" [Petitti 2009; PMID: 19189910].
5.2.2 Inconsistency of results Clinicians themselves will rarely explore the evidence as thoroughly as a guideline panel, nor will they
5.2.2.1 Deciding whether to use estimates devote as much thought to the tradeoffs, or the possible underlying values and preferences in the
from a subgroup analysis population. GRADE encourages panels to deal with their discomfort and to make recommendations even
5.2.3 Indirectness of evidence when confidence in effect estimate is low and/or desirable and undesirable consequences are closely
5.2.4 Imprecision balanced. Such recommendations will inevitably be weak, and may be accompanied by qualifications.
5.2.4.1 Imprecision in guidelines In the unusual circumstances in which panels may choose not to make a recommendation, they should
5.2.4.2 Imprecision in in systematic specify the reason for this decision (see above).
reviews
imprecision
5.3. Factors that can increase the quality of the 6.2 Factors determining direction and strength of
evidence
recommendations
5.3.3. Effect of plausible residual confounding Four key factors influence the direction and the strength of a recommendation (Table 6.2)
Table 6.2. Domains that contribute to the strength of a recommendation
6.1 Recommendations and their strength Domain Comment
Balance between desirable and undesirable The larger the differences between the desirable
6.1.2 Weak recommendation outcomes (tradeoffs) taking into account: and undesirable consequences, the more likely a
6.1.3 Recommendations to use interventions strong recommendation is warranted. The
only in research best estimates of the magnitude of effects on
smaller the net benefit and the lower certainty
6.1.4 No recommendation desirable and undesirable outcomes
for that benefit, the more likely a weak
6.2 Factors determining direction and strength of importance of outcomes (estimated typical values recommendation is warranted
recommendations and preferences)
consequences Confidence in the magnitude of estimates of effect The higher the quality of evidence, the more
of the interventions on important outcomes (overall likely a strong recommendation is warranted
quality of evidence for outcomes)
6.2.1.2 Best estimates of values and Confidence in values and preferences and their The greater the variability in values and
preferences variability preferences, or uncertainty about typical values
6.3.2 Confidence in best estimates of and preferences, the more likely a weak
magnitude of effects (quality of evidence) recommendation is warranted
6.3.3 Confidence in values and preferences Resource use The higher the costs of an intervention (the more
6.3.4 Resource use (cost) resources consumed), the less likely a strong
6.3.4.1 Differences between costs and recommendation is warranted
other outcomes
6.3.4.2 Perspective
cost) 6.2.1 Balance of desirable and undesirable consequences
6.3.4.6 Economic model Deciding about the balance between desirable and undesirable outcomes ("tradeoffs") one considers
6.3.4.7 Consideration of resource use in two domains:
recommendations
1. best estimates of the magnitude of desirable effects and the undesirable effects (summarized in
evidence profiles)
6.3.2 Symbolic representation 2. importance of outcomes – typical values that patients or a population apply to those outcomes
6.4.3 Providing transparent statements about (“weight” of outcomes).
strategies 6.2.1.1 Estimates of the magnitude of the desirable and undesirable effects
7.1.2. Establishing the role of a test Large relative effects of an intervention consistently pointing in the same direction towards desirable
7.1.3. Clear clinical questions or towards undesirable effects are more likely to warrant a strong recommendation. Conversely, large
7.2. Gold standard and reference test relative effects of an intervention pointing in opposite directions large desirable effects accompanied
by large undesirable ones will lead to weak recommendations.
1. Overview of the GRADE Approach Large absolute effects are also more likely to lead to a strong recommendation, than small absolute
1.1 Purpose and advantages of the GRADE effects. Baseline risk (control event rate) can influence the balance of desirable and undesirable
approach outcomes. Large baseline risk differences will result in large differences in absolute effects of
1.2 Separation of confidence in effect estimates interventions. The strength of recommendations and its direction, therefore, will likely differ in high and
from strength of recommendations lowrisk groups.
approach
1.4 Modifications to the GRADE approach Examples
2. Framing the health care question Large gradient between the desirable and undesirable effects (higher likelihood of a strong
2.1 Defining the patient population and intervention recommendation)
2.3 Other considerations 1. The very large gradient between the benefits of low dose aspirin on reductions in death and recurrent
myocardial infarction and the undesirable consequences of minimal side effects and costs make a strong
recommendation very likely.
GRADE approach
3.1 Steps for considering the relative importance of Small gradient between the desirable and undesirable effects (higher likelihood of a weak
outcomes recommendation)
1. Consider the choice of immunomodulating agents, namely cyclosporine or tacrolimus, in kidney
1. Consider the choice of immunomodulating agents, namely cyclosporine or tacrolimus, in kidney
transplant recipients. Tacrolimus results in better graft survival (a highly valued outcome), but at the
outcomes important cost of a higher incidence of diabetes (the longterm complications of which can be
3.4 Surrogate (substitute) outcomes devastating).
4.1 Evidence Tables 2. Patients with atrial fibrillation typically are more stroke averse than bleeding averse. If, however, the
4.2 GRADE Evidence Profile risk of stroke is sufficiently low, the tradeoff between stroke reduction and increase in bleeding risk
4.3 Summary of Findings table with anticoagulants is closely balanced.
5.1.1 Study design
evidence 6.2.1.2 Best estimates of values and preferences
5.2.2.1 Deciding whether to use estimates Without considering the associated values and preferences, assessing large vs. small magnitude of
from a subgroup analysis effects may be misleading. Balancing the magnitude of desirable and undesirable outcomes requires
5.2.3 Indirectness of evidence considering weight (importance) of those outcomes that is determined by values and preferences.
5.2.4 Imprecision Ideally, to inform estimates of typical patient values and preferences, guideline panels will conduct or
5.2.4.1 Imprecision in guidelines identify systematic reviews of relevant studies of patient values and preferences. There is, however,
5.2.4.2 Imprecision in in systematic paucity of empirical examinations of patients’ values and preferences.
reviews
Well resourced guideline panels will usually complement such studies with consultation with individual
patients and patients’ groups. The panel should discuss whose values these people represent, namely
imprecision representative patients, a defined subset of patients, or representatives of the general population.
5.3. Factors that can increase the quality of the Less wellresourced panels, without systematic reviews of values and preferences or consultation with
evidence patients and patient groups, must rely on unsystematic reviews of the available literature and their
5.3.1 Large magnitude of an effect experience of interactions with patients. How well such estimates correspond to true typical values and
5.3.2. Doseresponse gradient preferences is likely to be uncertain.
5.3.3. Effect of plausible residual confounding Whatever the source of estimates of typical values and preferences, explicit, transparent statements of
5.4 Overall quality of evidence the panel’s choices are imperative (see 6.3.3 Providing transparent statements about assumed values and
6. Going from evidence to recommendations preferences).
6.1.3 Recommendations to use interventions 6.3.2 Confidence in best estimates of magnitude of effects (quality of evidence)
only in research
6.2 Factors determining direction and strength of For all outcomes considered, the GRADE process requires a rating describing the quality of evidence.
Ultimately, guideline authors will form their recommendations based on their confidence in all effect
recommendations
estimates for each outcome considered critical to their recommendation and the quality of evidence.
6.2.1 Balance of desirable and undesirable Quality of evidence ratings are determined by the eight already discussed; the five criteria that result in
consequences rating down the quality of evidence (study limitations, inconsistency, indirectness, imprecision, and
6.2.1.1 Estimates of the magnitude of the publication bias result in rating down the quality of evidence whereas the remaining three criteria, lead
desirable and undesirable effects to an increase in evidence quality; large magnitude of effect, doseresponse gradient and when all
6.2.1.2 Best estimates of values and plausible biases or confounders increase our confidence in the estimated effect.
preferences
6.3.2 Confidence in best estimates of Typically, a strong recommendation is associated with high, or at least moderate, confidence in the
effect estimates for critical outcomes. If one has high confidence in effects on some critical outcomes
(typically benefits), but low confidence in effects on other outcomes considered critical (often longterm
6.3.3 Confidence in values and preferences harms), then a weak recommendation is likely warranted. Even when an apparently large gradient exists
6.3.4 Resource use (cost) in the balance of desirable vs. undesirable outcomes, panels will be appropriately reluctant to offer a
6.3.4.1 Differences between costs and strong recommendation if their confidence in effect estimates for some critical outcomes is low.
other outcomes
6.3.4.2 Perspective For some questions, direct evidence about the effects on some critical outcomes may be lacking (e.g.
6.3.4.3 Resource implications considered quality of life has not been measured in any study). In such instances, even if well measured
6.3.4.4 Confidence in the estimates of surrogates are available, confidence in estimates of effects on patientimportant outcomes is very likely
to be low.
cost) Low confidence in effect estimates may, rarely, be tied to strong recommendations. In general, GRADE
6.3.4.5 Presentation of resource use discourages guideline panels from making strong recommendations when their confidence in
6.3.4.6 Economic model estimates of effect for critical outcomes is low or very low. GRADE has identified five paradigmatic
6.3.4.7 Consideration of resource use in situations in which strong recommendations may be warranted despite low or very low quality of
recommendations evidence (Table 6.3). These situations can be conceptualized as ones in which a panel would have a low
6.4 Presentation of recommendations level of regret if subsequent evidence showed that their recommendation was misguided.
6.3.2 Symbolic representation Table 6.3. Paradigmatic situations in which a strong recommendation may be warranted despite
6.4.3 Providing transparent statements about low or very low confidence in effect estimates
6.5 The EvidencetoDecision framework Condition Example
7. The GRADE approach for diagnostic tests and 1 When low quality evidence 1. Fresh frozen plasma or
strategies suggests benefit in a life vitamin K in a patient receiving
7.1. Questions about diagnostic tests threatening situation (evidence warfarin with elevated INR and
regarding harms can be low or an intracranial bleed. Only low
7.1.1. Establishing the purpose of a test high) quality evidence supports the
7.1.2. Establishing the role of a test benefits of limiting the extent of
7.1.3. Clear clinical questions the bleeding.
7.2. Gold standard and reference test 2. Amphotericin B vs.
itraconazole in life threatening
disseminated blastomycosis.
1. Overview of the GRADE Approach High quality evidence suggests
1.1 Purpose and advantages of the GRADE that amphotericin B is more toxic
approach than itraconazole, and low
1.2 Separation of confidence in effect estimates quality evidence suggests that it
from strength of recommendations reduces mortality in this context.
2 When low quality evidence Headtotoe CT/MRI screening
1.3 Special challenges in applying the the GRADE suggests benefit and high quality for cancer. Low quality evidence
approach evidence suggests harm or a of benefit of early detection but
1.4 Modifications to the GRADE approach very high cost high quality evidence of possible
2. Framing the health care question harm and/or high cost (strong
2.1 Defining the patient population and intervention recommendation against this
strategy)
2.2 Dealing with multiple comparators 3 When low quality evidence Helicobacter pylori eradication
2.3 Other considerations suggests equivalence of two in patients with early stage
2.4 Format of health care questions using the alternatives, but high quality gastric MALT lymphoma with
GRADE approach evidence of less harm for one of H. pylori positive. Low quality
3. Selecting and rating the importance of outcomes the competing alternatives evidence suggests that initial H.
pylori eradication results in
3.1 Steps for considering the relative importance of similar rates of complete
outcomes response in comparison with the
3.2 Influence of perspective alternatives of radiation therapy
3.3 Using evidence in rating the importance of or gastrectomy; high quality
outcomes evidence suggests less
harm/morbidity
3.4 Surrogate (substitute) outcomes 4 When high quality evidence Hypertension in women planning
4. Summarizing the evidence suggests equivalence of two conception and in pregnancy.
4.1 Evidence Tables alternatives and low quality Strong recommendations for
4.1 Evidence Tables alternatives and low quality Strong recommendations for
4.2 GRADE Evidence Profile evidence suggests harm in one labetalol and nifedipine and
4.3 Summary of Findings table alternative strong recommendations against
angiotensin converting enzyme
5. Quality of evidence (ACE) inhibitors and angiotensin
5.1 Factors determining the quality of evidence receptor blockers (ARB) all
5.1.1 Study design agents have high quality
5.2 Factors that can reduce the quality of the evidence of equivalent beneficial
outcomes, with low quality
evidence evidence for greater adverse
5.2.1 Study limitations (Risk of Bias) effects with ACE inhibitors and
5.2.2 Inconsistency of results ARBs
5.2.2.1 Deciding whether to use estimates 5 When high quality evidence Testosterone in males with or at
from a subgroup analysis suggests modest benefits and risk of prostate cancer. High
low/very low quality evidence quality evidence for moderate
5.2.3 Indirectness of evidence suggests possibility of benefits of testosterone treatment
5.2.4 Imprecision catastrophic harm in men with symptomatic
5.2.4.1 Imprecision in guidelines androgen deficiency to improve
5.2.4.2 Imprecision in in systematic bone mineral density and muscle
reviews strength. Low quality evidence
for harm in patients with or at
5.2.4.3 Rating down two levels for risk of prostate cancer
imprecision INR – international normalized ratio; CT – computed tomography; MRI – magnetic resonance imaging;
5.2.5 Publication bias MALT – mucosaassociated lymphoid tissue.
evidence
6.1 Recommendations and their strength Uncertainty concerning values and preferences or their variability among patients may lower the
6.1.1 Strong recommendation strength of a recommendation.
6.1.2 Weak recommendation As noted above, systematic study of patients’ values and preferences are very limited. Thus, panels will
6.1.3 Recommendations to use interventions often be uncertain about typical values and preferences. The greater is the uncertainty, the more likely
only in research they will make a weak recommendation. Given the sparse systematic study of patients’ values and
6.1.4 No recommendation preferences, one could argue that large uncertainty always exists about the patients’ perspective. On the
6.2 Factors determining direction and strength of other hand, clinicians’ experience with patients may provide considerable additional insight. Indeed, on
recommendations occasion, panels will, on the basis of clinical experience, be confident regarding typical patient’s values
6.2.1 Balance of desirable and undesirable and preferences. Pregnant women’s strong aversion to even a small risk of important fetal abnormalities
consequences may be one such situation.
Large variability in values and preferences may also make a weak recommendation more likely. In such
6.2.1.2 Best estimates of values and situations, it is less likely that a single recommendation would apply uniformly across all patients, and
preferences the right course of action is likely to differ between patients. Again, systematic research about
6.3.2 Confidence in best estimates of variability in values and preferences is sparse. On the other hand, clinical experience may leave a panel
magnitude of effects (quality of evidence) confident that values and preferences differ widely among patients.
6.3.4.1 Differences between costs and Example
other outcomes 1. A hopeful patient may place more emphasis on a small chance of benefit, whereas a pessimistic, risk
6.3.4.2 Perspective averse patient may place more emphasis on avoiding the risks associated with a potentially beneficial
6.3.4.3 Resource implications considered therapy. Some patients may have a belief that even if the risk of an adverse event is low, they will be the
6.3.4.4 Confidence in the estimates of person who will suffer such an adverse effect. For instance, in patients with idiopathic pulmonary
resource use (quality of the evidence about fibrosis, evidence for the benefit of steroids warrants only low confidence, whereas we can be very
cost) confident of a wide range of adverse effects associated with steroids. The hopeful patient with
6.3.4.5 Presentation of resource use pulmonary fibrosis may be enthusiastic about use of steroids, whereas the riskaverse patient is likely to
6.3.4.6 Economic model decline.
recommendations
6.4 Presentation of recommendations 2. Thromboprophylaxis reduces the incidence of venous thromboembolism in immobile, hospitalized
6.4.1 Wording of recommandations severely ill medical patients. Careful thromboprophylaxis has minimal side effects and relatively low
6.3.2 Symbolic representation cost while being very effective at preventing deep venous thrombosis and its sequelae. Peoples’ values
6.4.3 Providing transparent statements about and preferences are such that virtually all patients admitted to a hospital would, if they understood the
choice they were making, opt to receive some form of thromboprophylaxis. Those making
recommendations can thus offer a strong recommendation for thromboprophylaxis for patients in this
6.5 The EvidencetoDecision framework setting.
strategies
7.1. Questions about diagnostic tests 3. A systematic review and metaanalysis describes a relative risk reduction (RRR) of approximately
7.1.1. Establishing the purpose of a test 80% in recurrent DVT for prophylaxis beyond 3 months up to one year. This large effect supports a
7.1.2. Establishing the role of a test strong recommendation for warfarin. Furthermore, the relatively narrow 95% confidence interval
7.1.3. Clear clinical questions (approximately 74 to 88%) suggests that warfarin provides a RRR of at least 74%, and further supports a
7.2. Gold standard and reference test strong recommendation. At the same time, warfarin is associated with an inevitable burden of keeping
dietary intake of vitamin K relatively constant, monitoring the intensity of anticoagulation with blood
1. Overview of the GRADE Approach tests, and living with the increased risk of both minor and major bleeding. It is likely, however, that most
1.1 Purpose and advantages of the GRADE patients would prefer avoiding another DVT and accept the risk of a bleeding episode. As a result,
approach almost all patients with high risk of recurrent DVT would choose taking warfarin for 3 to 12 months,
1.2 Separation of confidence in effect estimates suggesting the appropriateness of a strong recommendation. Thereafter, there may be an appreciable
from strength of recommendations number of patients who would reject lifelong anticoagulation.
approach
2. Framing the health care question 6.3.4 Resource use (cost)
2.2 Dealing with multiple comparators Panels may or may not consider resource use in their judgments about the direction and strength of
2.3 Other considerations recommendations. Reasons for not considering resource use include a lack of reliable data, the
2.4 Format of health care questions using the intervention is not useful and the effort of calculating resource use can be spared, the desirable effects
GRADE approach so greatly outweigh any undesirable effects that resource considerations would not alter the final
3. Selecting and rating the importance of outcomes judgment, or they have elected (or been instructed) to leave resource considerations up to other decision
3.1 Steps for considering the relative importance of makers. Panels should be explicit about the decision they made not to consider resource utilization and
outcomes the reason for their decision.
3.2 Influence of perspective If they elect to include resource utilization when making a recommendation, but have not included
3.3 Using evidence in rating the importance of resource use as a consequence when preparing an evidence profile, they should be explicit about what
outcomes types of resource use they considered when making the recommendation and whatever logic or evidence
3.4 Surrogate (substitute) outcomes was used in their judgments.
4.1 Evidence Tables
4.2 GRADE Evidence Profile Cost may be considered just another potentially important outcome – like mortality, morbidity, and
4.3 Summary of Findings table quality of life – associated with alternative ways of managing patient problems. In addition to these
5. Quality of evidence clinical outcomes, however, an intervention may increase costs or decrease costs. The GRADE
5.1 Factors determining the quality of evidence approach recommends that important or critical resource use be considered alongside other relevant
5.1 Factors determining the quality of evidence approach recommends that important or critical resource use be considered alongside other relevant
outcomes in evidence profiles and summary of findings tables. It is important to use natural units when
5.1.1 Study design
presenting resource use data as these can be applied in any setting.
evidence Special considerations when incorporating resources use (cost) in recommendations:
5.2.1 Study limitations (Risk of Bias) ● What are the differences between costs and other outcomes?
5.2.2.1 Deciding whether to use estimates ● Which perspective to take?
from a subgroup analysis ● Which resource implications to include?
● How to make judgments about the quality of the evidence?
5.2.4 Imprecision
5.2.4.1 Imprecision in guidelines ● How to present these implications?
5.2.4.2 Imprecision in in systematic ● What is potential usefulness of a formal economic model?
reviews
5.2.4.3 Rating down two levels for ● How to consider resource use in formulating recommendations?
imprecision
5.3. Factors that can increase the quality of the 6.3.4.1 Differences between costs and other outcomes
evidence
5.3.1 Large magnitude of an effect There are several differences between costs and other outcomes:
5.3.3. Effect of plausible residual confounding 1. With costs the issue of who pays and who gains is most prominent.
5.4 Overall quality of evidence 2. Attitudes about the extent to which costs should influence the decision differ depending on
6. Going from evidence to recommendations who bears the cost.
6.1.1 Strong recommendation 3. Costs tend to vary widely across jurisdictions and over time.
6.1.2 Weak recommendation 4. People have different perspectives on the envelope in which they are considering opportunity
6.1.3 Recommendations to use interventions costs.
only in research 5. Resource allocation is a far more political issue than consideration of other outcomes.
6.2 Factors determining direction and strength of 1. With costs the issue of who pays and who gains is most prominent.
recommendations For most outcomes other than costs, it is clear that the patient and, secondarily, the patient’s family gains
6.2.1 Balance of desirable and undesirable the advantages, and has to live with the disadvantages (this is not true of all outcomes – with
consequences vaccinations the entire community benefits from the herd effect, or widespread use of antibiotics may
6.2.1.1 Estimates of the magnitude of the have downstream adverse consequences of drug resistance). Health care costs are often borne by the
desirable and undesirable effects society as a whole. Even within a society, who bears the cost may differ depending on the patient’s age
6.2.1.2 Best estimates of values and or situation.
preferences 2. Attitudes about the extent to which costs should influence the decision differ depending on who
6.3.2 Confidence in best estimates of bears the cost.
6.3.3 Confidence in values and preferences If costs are borne by the government, or a third party payer, some would argue that the physician’s
6.3.4 Resource use (cost) responsibility to the patient means that costs should not influence the decision. On the other hand, a
clinicians’ responsibility when caring for a patient is discharged in a broader context: resources that are
used for an intervention cannot be used for something else and can affect the ability of the health system
other outcomes to best meet the needs of those it serves.
6.3.4.2 Perspective
6.3.4.3 Resource implications considered 3. Costs tend to vary widely across jurisdictions or even within jurisdictions, and over time.
6.3.4.4 Confidence in the estimates of Costs of drugs are largely unrelated to the costs of production of those drugs, and more to marketing
resource use (quality of the evidence about decisions and national policies. Hospitals or health maintenance organizations may, for instance,
cost) negotiate special arrangements with pharmaceutical companies for prices substantially lower than are
6.3.4.5 Presentation of resource use available to patients or other providers. Even when resource use remains the same, the resource
6.3.4.6 Economic model implications may vary widely across jurisdictions. Costs can also vary widely over time (e.g. when a
6.3.4.7 Consideration of resource use in drug comes off patent or a new, cheaper technology becomes available). The large variability in costs
recommendations over time and jurisdictions requires that guideline panels formulate health care questions as specific as
6.4 Presentation of recommendations possible when bringing cost into the equation. The choice of comparator can be a particular problem in
6.4.1 Wording of recommandations economic analyses. If the choice of the comparator is inappropriate (for instance, no treatment rather
6.3.2 Symbolic representation than an alternative though less effective intervention) conclusions may be misleading. Even when
6.4.3 Providing transparent statements about resource use remains the same, the resource implications may vary widely across jurisdictions. A year’s
assumed values and preferences supply of a very expensive drug may pay a nurse’s salary in the United States, six nurses’ salaries in
Poland, and 30 nurses’ salaries in China. Thus, what one can buy with the resources saved if one
foregoes purchase of the drug (the “opportunity cost”) – and the health benefits achieved with those
7. The GRADE approach for diagnostic tests and expenditures will differ to a large extent.
strategies
7.1. Questions about diagnostic tests 4. People have different perspectives on the envelope in which they are considering opportunity
7.1.1. Establishing the purpose of a test costs.
7.1.2. Establishing the role of a test A hospital pharmacy with a fixed budget considering purchase of an expensive new drug will have a
7.1.3. Clear clinical questions clear idea of what that purchase will mean in terms of other medications the pharmacy cannot afford.
7.2. Gold standard and reference test People often assume the envelope is public health spending – funding a new drug or program will
constrain resources for other public health expenditures. However, one may not be sure that refraining
1. Overview of the GRADE Approach from that purchase really means that equivalent resources will be available for the health care system.
1.1 Purpose and advantages of the GRADE Further, one may ask if the public health care is spending the correct envelope.
approach 5. Resource allocation is a far more political issue than consideration of other outcomes.
from strength of recommendations Whether the guideline panel does or does not explicitly consider resource allocation issues, those politics
may bear on a guideline panel’s function through conflict of interest.
approach
1.4 Modifications to the GRADE approach Despite these differences, approaches to cost (resource use) are similar to other outcomes:
2.1 Defining the patient population and intervention ● guideline panels need to consider only important resource implications
2.2 Dealing with multiple comparators ● decision makers require an estimate of the difference between treatment and control
● guideline panels must make explicit judgments about the quality of evidence regarding
incremental resource use.
GRADE approach
outcomes
3.2 Influence of perspective 6.3.4.2 Perspective
outcomes GRADE suggests that a broad perspective is desirable.
4. Summarizing the evidence A recommendation could be intended for a very narrow audience, such as a single hospital pharmacy, an
4.1 Evidence Tables individual hospital or a health maintenance organization. Alternatively it could be intended for a health
4.2 GRADE Evidence Profile region, a country or an international audience.
4.3 Summary of Findings table Regardless of how narrow or broad the intended audience, guideline groups that choose to incorporate
5. Quality of evidence resource implications must be explicit about the perspective they are taking.
Alternatively a guideline may choose to take a societal perspective, and include all important resource
5.1.1 Study design implications, regardless of who bears the costs.
evidence In a publicly funded health system the patient perspective would consider only resource implications that
directly affect individual patients (e.g. out of pocket costs) and would ignore most of the costs generated
5.2.1 Study limitations (Risk of Bias) directly affect individual patients (e.g. out of pocket costs) and would ignore most of the costs generated
5.2.2 Inconsistency of results (e.g. costs borne by the government). In European health care systems in which, for the most part,
5.2.2.1 Deciding whether to use estimates governments bear the cost of health care, expenses borne directly by patients will be minimal. A
from a subgroup analysis pharmacy perspective would ignore downstream cost savings resulting for adverse events (e.g. stroke or
myocardial infarction) prevented by a drug. A hospital perspective would ignore outpatient costs either
incurred, or prevented. In the private sector, where disenrollment and loss of insurance can shift the
5.2.4 Imprecision burden of costs from one system to another, estimates of resource use should include the downstream
5.2.4.1 Imprecision in guidelines costs of all treated patients, not just those who remain in a particular health plan.
reviews An even broader perspective, that of society, would include indirect costs or savings (e.g. lost wages).
5.2.4.3 Rating down two levels for These are difficult to estimate and controversial because they assume that lost productivity will not be
imprecision replaced by an individual who otherwise would be unemployed or underemployed, and implicitly place
5.2.5 Publication bias lower value on individuals not working (e.g. the retired). Taking a health systems perspective has
another advantage. A comprehensive display of the resource use associated with alternative
management strategies allows an individual or group – a patient, a pharmacy, or a hospital – to examine
evidence the relative merits of the alternatives from their particular perspective.
5.3.2. Doseresponse gradient Clinicians seeing patients who are uncovered by either public or private insurance may need to help
5.3.3. Effect of plausible residual confounding these individuals to make decisions taking into account their out of pocket costs. This is particularly true
5.4 Overall quality of evidence when clinical advantages and disadvantages are closely balanced, and there are substantial out of pocket
6. Going from evidence to recommendations costs. In these circumstances, if a guideline panel has used the GRADE approach and made evidence
6.1 Recommendations and their strength profiles available to the guideline users, clinicians can review evidence summaries and ensure that the
patients’ decision to accept the recommended management strategy is consistent with their values and
preferences – either though communicating the information directly to the patient, or by finding out what
6.1.2 Weak recommendation the patients’ situation and values and preferences are.
only in research
recommendations
6.2.1 Balance of desirable and undesirable Evidence profiles and summary of findings tables should always present resource use, not just monetary
consequences values as monetary values for the same resource will vary depending on setting.
6.2.1.1 Estimates of the magnitude of the We suggest that guideline developers document best estimates of resource use, not best estimate of
desirable and undesirable effects costs. Costs are a function of resources expended and the cost per unit of resource. Given the wide
6.2.1.2 Best estimates of values and variability in costs per unit, reporting only total costs across broad categories of resource expenditure
preferences leaves users without the information required to judge whether estimates of unit costs apply to their
6.3.2 Confidence in best estimates of setting. It is therefore recommended that natural units be used to estimate resource use. For example,
magnitude of effects (quality of evidence) required number of days stayed in hospital, the cost per night will vary depending on the setting.
Users of guidelines will be best informed if the guideline developers specify resources consumed by
alternate management strategies, because they can:
other outcomes ● judge whether the resource use reflects practice patterns in their setting
6.3.4.2 Perspective ● focus on the items of most relevance to them
6.3.4.4 Confidence in the estimates of ● ascertain whether the unit costs apply in their setting.
resource use (quality of the evidence about Unless resource use is specified, users in settings other than that on which the analysts focus cannot
cost) estimate the associated incremental costs of the intervention.
6.3.4.7 Consideration of resource use in 6.3.4.4 Confidence in the estimates of resource use (quality of the evidence about cost)
recommendations
6.4.1 Wording of recommandations Evidence of resource use may come from different sources than evidence of health benefits. This may
6.3.2 Symbolic representation be the case both because trials of interventions do not fully report resource use, because the trial
6.4.3 Providing transparent statements about situation may not fully reflect the circumstances (thus the resource use) that we would expect in clinical
practice, because the relevant resource use may extend beyond the duration of trial, and because
resource use may vary substantially across settings.
7. The GRADE approach for diagnostic tests and For resource use that is reported in the context of trials, criteria for quality assessment are identical to
strategies that of other outcomes. Just as for other outcomes of a trial, the quality of evidence may differ across
7.1. Questions about diagnostic tests different resources. For example, drug use may be relatively easy to estimate, whereas use of health
7.1.1. Establishing the purpose of a test professionals’ time may be more difficult, and the estimate of drug use may therefore be of higher
7.1.2. Establishing the role of a test quality.
1.1 Purpose and advantages of the GRADE A balance sheet (e.g. evidence profile) should inform judgments about whether the net benefits are
approach worth the incremental costs. Balance sheets efficiently present the raw information required to make
1.2 Separation of confidence in effect estimates informed explicit judgments concerning resource use in guideline recommendations. However, when
from strength of recommendations complex tradeoff decisions involving several outcomes need to be made judgments may remain implicit
1.3 Special challenges in applying the the GRADE or qualitatively described.
approach Pooling resource estimates from different studies is seldom as it can be quite controversial and should be
1.4 Modifications to the GRADE approach carefully considered. However, authors can consider presenting pooled estimates of resource use when
2. Framing the health care question they are confident that the outcome in question has a common meaning (i.e. number of nights stayed in
2.1 Defining the patient population and intervention hospital) across the studies involved in analysis. Even in this case, it is recommended that authors adjust
2.2 Dealing with multiple comparators for geographical and temporal differences in cost.
GRADE approach 6.3.4.6 Economic model
Formal economic modeling may – or may not be helpful.
outcomes
3.2 Influence of perspective Formal economic modeling results in cost per unit benefit achieved: cost per natural unit, such as cost
3.3 Using evidence in rating the importance of per stroke prevented (costeffectiveness analysis) cost per qualityadjusted life year gained (costutility
outcomes analysis) cost and benefits valued in monetary values (costbenefit analysis). These summaries can be
3.4 Surrogate (substitute) outcomes helpful for informing judgments. Unfortunately, many published costeffectiveness analyses have a high
4. Summarizing the evidence probability of being flawed or biased, and are settingspecific. When estimates of harms, benefits and
4.1 Evidence Tables resources used are based on low quality evidence, transparency of the economic model will be reduced
and the model may be misleading.
4.3 Summary of Findings table Should guideline panels consider developing their own formal economic model?
5. Quality of evidence Creating an economic model may be advisable if:
5.1.1 Study design ● guideline groups have the necessary expertise and resources
5.2 Factors that can reduce the quality of the ● difference in resources consumed by the alternative management strategies is large and
evidence therefore there is substantial uncertainty about whether the net benefits of an intervention are
5.2.1 Study limitations (Risk of Bias) worth the incremental costs
● quality of available evidence regarding resource consumption is high and it is likely that a full
5.2.2.1 Deciding whether to use estimates economic model would help inform a decision
5.2.3 Indirectness of evidence ● implementing an intervention requires large capital investments, such as building new
5.2.4 Imprecision facilities or purchasing new, expensive equipment.
5.2.4.1 Imprecision in guidelines Modeling – while necessary for taking into account complexities and uncertainties in calculating cost per
5.2.4.2 Imprecision in in systematic unit benefit – reduces transparency. Any model is only as good as the data on which it is based. When
reviews estimates of benefits, harms, or resources used come from low quality evidence, results of any economic
5.2.4.3 Rating down two levels for modeling will be highly speculative.
imprecision Although criteria to assess the credence to give to results from statistical models of costeffectiveness or
5.2.5 Publication bias costutility are available, these models generally include a large number of assumptions and varying
5.3. Factors that can increase the quality of the quality evidence for the estimates that are included in the model. For these reasons, GRADE working
evidence group recommends not including costeffectiveness or costutility models in evidence profiles. These
5.3.1 Large magnitude of an effect models may, however, inform judgments of a guideline panel, or those of governments, or third part
5.3.2. Doseresponse gradient payers considering whether to include an intervention among their programs’ benefits.
6. Going from evidence to recommendations 6.3.4.7 Consideration of resource use in recommendations
6.1.2 Weak recommendation Guideline panel may choose to explicitly consider or not to consider resource use in recommendations.
6.1.3 Recommendations to use interventions A guideline panel may legitimately choose to leave considerations of resource use aside, and offer a
only in research recommendation solely on the basis of other advantages and disadvantages of the alternatives being
6.1.4 No recommendation considered. Resource allocation must then be considered at the level of the ultimate decisionmaker – be
6.2 Factors determining direction and strength of it the patient and healthcare professional, an organization (e.g. hospital pharmacy or a health
recommendations maintenance organization), a third party payer, or a government. Guideline panels should be explicit
6.2.1 Balance of desirable and undesirable about the decision to consider or not to consider resource utilization.
consequences If guideline panel considers resource use it should, prior to bringing cost into the equation, first decide on
6.2.1.1 Estimates of the magnitude of the the quality of evidence regarding other outcomes, and weigh up the advantages and disadvantages.
desirable and undesirable effects Decisions regarding the importance of resource use issues will flow from this first step. For example,
6.2.1.2 Best estimates of values and resource implications may be irrelevant if evidence of net health benefits is lacking. If advantages of an
preferences intervention far outweigh disadvantages, resource use is less likely to be important. Resource use usually
6.3.2 Confidence in best estimates of becomes important when advantages and disadvantages are closely balanced.
magnitude of effects (quality of evidence) GRADE approach suggests that panels considering resource use should offer only a single
6.3.3 Confidence in values and preferences recommendation taking resource use into account. Panels should refrain from issuing two
6.3.4 Resource use (cost) recommendations – one not taking resource use into account and a second doing so. Although this would
6.3.4.1 Differences between costs and have the advantage of explicitness on which GRADE places a very high value, GRADE working group
other outcomes is concerned that those with interests in dissemination of an intervention would effectively use only the
6.3.4.2 Perspective recommendation ignoring resource implications as a weapon in their battle for funds (public funds, in
6.3.4.3 Resource implications considered particular).
cost)
6.3.4.6 Economic model 6.4 Presentation of recommendations
recommendations
6.4 Presentation of recommendations 6.4.1 Wording of recommandations
6.3.2 Symbolic representation Wording of a recommendation should offer clinicians as many indicators as possible for understanding
6.4.3 Providing transparent statements about and interpretation.
Recommendations should always answer the initial clinical question. Therefore, they should
specify patients or population (characterized by the disease and other identifying factors) for whom the
7. The GRADE approach for diagnostic tests and recommendation is intended and a recommended intervention as specifically and detailed as needed.
strategies Unless it is obvious, they should also specify the comparator. Sometimes, the recommendation may
7.1. Questions about diagnostic tests include a reference to the setting (e.g. primary or tertiary care, high or lowincome countries, etc.).
7.1.2. Establishing the role of a test In general, it seems preferable to present recommendations in favor of a particular management
7.1.3. Clear clinical questions approach rather than against an alternative. For instance, in considering the addition of aspirin to
7.2. Gold standard and reference test clopidogrel in patients who have had a stroke, it would be preferable to state: "In patients who have had
a stroke, we suggest clopidogrel alone vs. adding aspirin to clopidogrel" rather than: "In patients who
1. Overview of the GRADE Approach have had a stroke and are using clopidogrel, we suggest not adding aspirin". However, when a useless or
1.1 Purpose and advantages of the GRADE harmful therapy is in wide use, recommendations against a management approach are appropriate. For
approach instance, "In patients undergoing cardiac surgery who were not previously receiving beta blockers, we
1.2 Separation of confidence in effect estimates suggest not initiating perioperative beta blocker therapy".
from strength of recommendations Recommendations in the passive voice may lack clarity, therefore, GRADE suggest that guideline
1.3 Special challenges in applying the the GRADE developers present recommendations in the active voice.
approach For strong recommendations, the GRADE working group has suggested adopting terminology, such as
1.4 Modifications to the GRADE approach "we recommend..." or "clinicians should...", “clinicians should not…” or “Do…”, “Don’t…”
2.1 Defining the patient population and intervention For weak recommendations, the GRADE working group has suggested less definitive wording, such as
2.2 Dealing with multiple comparators "we suggest..." or "clinicians might..." or “We conditionally recommend…” or “We make a qualified
2.3 Other considerations recommendation that…”.
2.4 Format of health care questions using the Wording strong and weak recommendations is particularly important when guidelines are developed by
GRADE approach international organizations and/or are intended for patients and clinicians in different regions, cultures,
3. Selecting and rating the importance of outcomes traditions, and usage of language. It is also crucial to explicitly and precisely consider wording when
3.1 Steps for considering the relative importance of translating recommendations into different languages. Whatever terminology guideline panels choose to
outcomes use to communicate the dichotomous nature of a recommendation, it is essential that they inform their
3.2 Influence of perspective users what the terms imply by providing the explanations as in Table 5.9.
3.3 Using evidence in rating the importance of Misinterpretation is possible however strength of recommendations is expressed. We suggest guideline
outcomes developers consider using both words and symbols (which may be less confusing than numbers or letters)
3.4 Surrogate (substitute) outcomes to express strength of recommendations.
4.1 Evidence Tables
4.3 Summary of Findings table 6.3.2 Symbolic representation
5.1.1 Study design A variety of presentations of quality of evidence and strength of recommendations may be appropriate.
5.2 Factors that can reduce the quality of the Most guideline panels have used letters and numbers to summarize their recommendations. Because of
highly variable use of numbers and letters by different organizations this presentation may be confusing.
evidence
Symbolic representations of the quality of evidence and strength of recommendations are appealing in
5.2.1 Study limitations (Risk of Bias) that they are not burdened with this historical confusion. On the other hand, clinicians seem to be very
5.2.2 Inconsistency of results comfortable with numbers and letters, which are particularly suitable for verbal communication, so there
5.2.2.1 Deciding whether to use estimates may be good reasons why organizations have chosen to use them.
5.2.3 Indirectness of evidence The GRADE working group has decided to offer preferred symbolic representations, but users of
5.2.4 Imprecision guidelines based on the GRADE approach will often see numbers and letters being used to express the
5.2.4.1 Imprecision in guidelines quality of evidence and strength of a recommendation.
5.2.4.2 Imprecision in in systematic Table 6.4. Suggested representations of quality of evidence and strength of recommendations
5.2.4.2 Imprecision in in systematic Table 6.4. Suggested representations of quality of evidence and strength of recommendations
reviews
Quality of Evidence Symbol Letter (varies)
imprecision High ⨁⨁⨁⨁ A
Moderate ⨁⨁⨁◯ B
evidence Low ⨁⨁◯◯ C
5.3.1 Large magnitude of an effect Very low ⨁◯◯◯ D
5.3.3. Effect of plausible residual confounding Strength of Recommendation Symbol Number
5.4 Overall quality of evidence Strong for an intervention ↑↑ 1
6.1 Recommendations and their strength Weak for an intervention ↑? 2
6.1.1 Strong recommendation Weak against an intervention ↓? 2
6.1.3 Recommendations to use interventions Strong against an intervention ↓↓ 1
only in research
recommendations 6.4.3 Providing transparent statements about assumed values and preferences
consequences Ideally, recommendations should be accompanied by a statement presenting assumptions about the
6.2.1.1 Estimates of the magnitude of the values and preferences that underlie recommendations. For instance, a guideline addressing issues of
desirable and undesirable effects thrombosis prevention and treatment in pregnancy noted: "Our recommendations reflect a belief that
6.2.1.2 Best estimates of values and most women will place a low value on avoiding the pain, cost, and inconvenience of heparin therapy to
preferences avoid the small risk of even a minor abnormality in their child associated with warfarin prophylaxis".
In addition to, or in place of, making such general statements, guideline panels may provide statements
associated with individual recommendations, especially those that are particularly sensitive to values
6.3.3 Confidence in values and preferences and preferences. In such cases authors should place statements about underlying values and preferences
6.3.4 Resource use (cost) with the recommendation statement rather than in the accompanying text. This prominent positioning of
6.3.4.1 Differences between costs and the statements will make it less likely that users of guidelines miss the importance of the values and
other outcomes preference judgments.
6.3.4.2 Perspective
6.3.4.3 Resource implications considered Consider, for instance, two groups that were part of a broader guideline effort made apparently
6.3.4.4 Confidence in the estimates of contradictory recommendations regarding aspirin vs. clopidogrel in patients with atherosclerotic vascular
disease, despite using the same underlying evidence from a trial that enrolled both patients with
threatened stroke and those with peripheral vascular disease. One group focusing on stroke prevention
cost) recommended clopidogrel over aspirin stating: "This recommendation places a relatively high value on a
6.3.4.5 Presentation of resource use small absolute risk reduction in stroke rates, and a relatively low value on minimizing drug
6.3.4.6 Economic model expenditures". The other group focusing on the peripheral vascular disease recommended aspirin over
6.3.4.7 Consideration of resource use in clopidogrel, stating: "This recommendation places a relatively high value on avoiding large resource
recommendations expenditures to achieve small reductions in vascular events". These recommendations suggest opposite
6.4 Presentation of recommendations courses of action. Both are appropriate given the stated values and preferences, which were made
6.4.1 Wording of recommandations explicit in qualifying statements accompanying each recommendation.
Another way to frame values and preferences statements that panels may want to consider is in terms of
6.4.3 Providing transparent statements about patients who do not share the values and preferences underlying the recommendation. For instance, one
assumed values and preferences may say: "For most healthy patients with achalasia undergoing an invasive procedure, we suggest
6.5 The EvidencetoDecision framework minimally invasive surgical myotomy rather than pneumatic dilatation. Patients who prefer to avoid
7. The GRADE approach for diagnostic tests and surgery and the high rates of gastroesophageal reflux disease seen after surgery, and who are willing to
strategies accept a higher initial failure rate and longterm recurrence rate, can reasonably choose pneumatic
7.1. Questions about diagnostic tests dilatation".
Ultimately, guideline panels must integrate these determinants of direction and strength to make a strong
approach or weak recommendation for or against an intervention. Table 6.2 presents the generic Evidenceto
1.2 Separation of confidence in effect estimates Decision (EtD) table that groups making recommendations may use to facilitate decision making, record
from strength of recommendations judgements, and document the process of going from evidence to the decision. Table 6.3 presents an
1.3 Special challenges in applying the the GRADE example of EtD framework used in development of recommendations about the use of ASA in patients
approach with atrial fibrillation (PDF version).
2.1 Defining the patient population and intervention Table 6.5. The EvidencetoDecision framework
2.2 Dealing with multiple comparators Criteria Judgements Research evidence Additional
considerations
2.4 Format of health care questions using the ○ No
GRADE approach Is there a
○ Probably no
○ Uncertain
3. Selecting and rating the importance of outcomes Problem problem
priority? ○ Probably yes
○ Yes
3.1 Steps for considering the relative importance of ○ Varies
outcomes
3.2 Influence of perspective The relative importance or values of the main outcomes of
interest:
outcomes Certainty
Outcome Relative of the
3.4 Surrogate (substitute) outcomes importance evidence
(GRADE)
4.1 Evidence Tables Outcome 1 CRITICAL ⨁⨁⨁⨁
HIGH
4.3 Summary of Findings table ⨁⨁⨁◯
Outcome 2 CRITICAL
5. Quality of evidence MODERATE
5.1 Factors determining the quality of evidence ○ No included

What is the studies Summary of findings: intervention C
5.1.1 Study design overall ○ Very low
certainty of this ○ Low
5.2 Factors that can reduce the quality of the evidence? ○ Moderate Without With
Difference
Relative
effect
○ High Outcome intervention intervention
evidence I I (95% CI) (RR)
(95% CI)
5.2.2 Inconsistency of results 25 fewer per RR 0.6
Outcome 37 per 1000 1000(from
5.2.2.1 Deciding whether to use estimates 1
61 per 1000
(25 to 49)
(0.4 to
12 fewer to 0.8)
37 fewer)
5.2.3 Indirectness of evidence 9 fewer per RR 0.92
5.2.4 Imprecision Outcome 108 per 1000 99 per 1000 1000(from (0.74 to
2 (80 to 134) 26 more to 1.24)
5.2.4.1 Imprecision in guidelines 28 fewer)

reviews
Benefits &
5.2.4.3 Rating down two levels for harms of the ○ Important
options
imprecision uncertainty or
variability
5.2.5 Publication bias ○ Possibly
5.2.5 Publication bias ○ Possibly
important
5.3. Factors that can increase the quality of the Is there uncertainty or
important variability
evidence uncertainty
about how
○ Probably no
important
5.3.1 Large magnitude of an effect much people
value the main uncertainty of
variability
5.3.2. Doseresponse gradient outcomes?
○ No important
5.3.3. Effect of plausible residual confounding uncertainty of
variability
5.4 Overall quality of evidence ○ No known
undesirable
6.1 Recommendations and their strength ○ No
Are the ○ Probably no
6.1.1 Strong recommendation desirable ○ Uncertain
6.1.2 Weak recommendation anticipated
effects large?
○
○
Probably yes
Yes
6.1.3 Recommendations to use interventions ○ Varies
only in research
○ No
6.1.4 No recommendation Are the ○ Probably no
6.2 Factors determining direction and strength of undesirable
anticipated
○
○
Uncertain
Probably yes
recommendations effects small? ○ Yes
○ Varies
consequences Are the ○ No
6.2.1.1 Estimates of the magnitude of the desirable ○ Probably no
Uncertain
effects large ○
desirable and undesirable effects relative to ○ Probably yes
undesirable ○ Yes
6.2.1.2 Best estimates of values and effects? ○ Varies
preferences
6.3.2 Confidence in best estimates of ○
○
No
Probably no
Are the
magnitude of effects (quality of evidence) resources ○ Uncertain
required ○ Probably yes
6.3.3 Confidence in values and preferences small? ○ Yes
6.3.4 Resource use (cost) ○ Varies
Resource
6.3.4.1 Differences between costs and use
○ No
other outcomes Is the
incremental
○ Probably no
○ Uncertain
6.3.4.2 Perspective cost small
relative to the ○ Probably yes
6.3.4.3 Resource implications considered net benefits? ○ Yes
○ Varies
resource use (quality of the evidence about ○ Increased
○ Probably
cost) What would be increased
the impact on ○ Uncertain
6.3.4.5 Presentation of resource use Equity
health ○ Probably
6.3.4.6 Economic model inequities? reduced
○ Reduced
6.3.4.7 Consideration of resource use in ○ Varies
recommendations
○ No
6.4 Presentation of recommendations Is the option ○ Probably no
6.4.1 Wording of recommandations Acceptability acceptable to
key
○
○
Uncertain
Probably yes
6.3.2 Symbolic representation stakeholders? ○ Yes
○ Varies
assumed values and preferences ○ No
6.5 The EvidencetoDecision framework Is the option
○
○
Probably no
Uncertain
Feasibility feasible to
7. The GRADE approach for diagnostic tests and implement? ○ Probably yes
○ Yes
strategies ○ Varies
7.1.3. Clear clinical questions Evidence to Decisions Framework: explanations
Purpose of the framework
The purpose of this framework is to help panels developing guidelines move from evidence to
1. Overview of the GRADE Approach recommendations. It is intended to:
1.1 Purpose and advantages of the GRADE ● Inform panel members’ judgements about the pros and cons of each option (intervention) that
approach is considered
1.2 Separation of confidence in effect estimates ● Ensure that important factors that determine a recommendation (criteria) are considered
from strength of recommendations ● Provide a concise summary of the best available research evidence to inform judgements
about each criterion
1.3 Special challenges in applying the the GRADE ● Help structure discussion and identify reasons for disagreements
approach ● Make the basis for recommendations transparent to guideline users
2. Framing the health care question Development of the framework
2.1 Defining the patient population and intervention The framework is being developed as part of the DECIDE project using an iterative process informed by
the GRADE approach for going from evidence to clinical recommendations, a review of relevant
2.2 Dealing with multiple comparators literature, brainstorming, feedback from stakeholders, application of the framework to examples, a
2.3 Other considerations survey of policymakers, user testing, and trials. DECIDE (Developing and Evaluating Communication
2.4 Format of health care questions using the Strategies to Support Informed Decisions and Practice Based on Evidence) is a 5year project (running
GRADE approach from January 2011 to 2015) cofunded by the European Commission under the Seventh Framework
Programme. DECIDE’s primary objective is to improve the dissemination of evidencebased
3. Selecting and rating the importance of outcomes recommendations by building on the work of the GRADE Working Group to develop and evaluate
3.1 Steps for considering the relative importance of methods that address the targeted dissemination of guidelines.
outcomes
3.2 Influence of perspective Description of the framework
3.3 Using evidence in rating the importance of The framework includes a table with the following columns:
● Criteria (factors that should be considered) for health system or public health
outcomes recommendations
3.4 Surrogate (substitute) outcomes ● Judgements that the panel members must make in relation to each criterion, which may
4. Summarizing the evidence include draft judgements suggested by the people who have prepared the framework
4.1 Evidence Tables ● Research evidence to inform each of those judgements, which may include links to more
4.2 GRADE Evidence Profile detailed summaries of the evidence
● Additional considerations to inform or justify each judgement
5. Quality of evidence The framework also includes the following conclusions that the panel members must reach, which may
5.1 Factors determining the quality of evidence include draft conclusions suggested by the people who have prepared the framework:
5.1.1 Study design ● The balance of consequences of the option being considered in relation to the alternative
5.2 Factors that can reduce the quality of the (comparison)
● The type of recommendation (against the option, for considering the option under specified
evidence conditions, or for the option)
5.2.1 Study limitations (Risk of Bias) ● The recommendation in concise, clear and actionable text
5.2.2 Inconsistency of results ● The justification for the recommendation, flowing from the judgements in relation to the
5.2.2.1 Deciding whether to use estimates criteria
● Any important subgroups considerations that may be relevant to guideline users
from a subgroup analysis ● Key implementation considerations (in addition to any that are specified in the
5.2.3 Indirectness of evidence recommendation), including strategies to address any concerns about the acceptability and
5.2.4 Imprecision feasibility of the option
5.2.4.1 Imprecision in guidelines ● Suggestions for monitoring and evaluation if the option is implemented, including any
5.2.4.2 Imprecision in in systematic important indicators that should be monitored and any needs for a pilot study or impact
evaluation
reviews ● Any key research priorities to address important uncertainties in relation to any of the criteria
imprecision Flexibility
5.2.5 Publication bias The framework is flexible. Organisations may elect to modify the terminology (and language) that is
5.3. Factors that can increase the quality of the used, the criteria, the response options and guidance for using the framework to ensure that the
framework is fit for purpose.
evidence
5.3.1 Large magnitude of an effect Use of the framework
5.3.2. Doseresponse gradient Suggestions for how to use the framework are provided in: Framework for going from evidence to a
5.3.2. Doseresponse gradient Suggestions for how to use the framework are provided in: Framework for going from evidence to a
5.3.3. Effect of plausible residual confounding recommendation – Guidance for health system and public health recommendations, including suggestions
5.4 Overall quality of evidence for preparing frameworks, supporting use of the framework by guideline panels, and using the
framework to support wellinformed decisions by guideline users.
6.1 Recommendations and their strength The final recommendation made by the guideline panel is a consensus based on the judgements of the
6.1.1 Strong recommendation panel members, informed by the evidence presented in the framework and the panel members’ expertise
6.1.2 Weak recommendation and experience.
6.1.3 Recommendations to use interventions Explanations of the criteria in the framework
only in research Why these criteria?
6.1.4 No recommendation The criteria included in the framework are ones that have emerged from our literature review,
6.2 Factors determining direction and strength of brainstorming, feedback from stakeholders, application of the framework to examples, a survey of
recommendations policymakers and user testing. It is possible that we will make further modifications based on continuing
feedback, applications of the framework and user testing. Guideline developers may also want to make
6.2.1 Balance of desirable and undesirable modifications, such as adding or removing criteria that are or are not important for them to consider.
consequences However, there is clear and consistent support for routinely including all of these criteria and, up to now,
6.2.1.1 Estimates of the magnitude of the a lack of clear and consistent support for including other potential criteria.
6.2.1.2 Best estimates of values and Detailed judgements
The judgements that need to be made are sometimes complex. Guideline panels are likely to find it
preferences helpful to make and record detailed judgements for some criteria using tables for detailed judgements.
6.3.2 Confidence in best estimates of This includes, for example, detailed judgements about the size of the effect for each outcome, the
magnitude of effects (quality of evidence) certainty of the evidence of the relative importance of the outcomes and resource use, and important
6.3.3 Confidence in values and preferences subgroup considerations. Some criteria could be split into two or more separate criteria and some panels
6.3.4 Resource use (cost) may elect to do this in order to highlight key considerations that are of particular importance for their
guidelines. For example, there are several reasons why an option may not be acceptable to key
6.3.4.1 Differences between costs and stakeholders and these could potentially be considered as separate criteria.
other outcomes
6.3.4.2 Perspective From whose perspective?
6.3.4.3 Resource implications considered Guideline panels should explicitly state the perspective that they are taking when making
6.3.4.4 Confidence in the estimates of recommendations. This is especially important for determining which costs (resource use) to consider. It
can also influence which outcomes and whose values are considered. For example, outofpocket costs
resource use (quality of the evidence about are important from the perspective of an individual patient, whereas costs to the government are
cost) important from the perspective of the government. Health system and public health decisions are made
6.3.4.5 Presentation of resource use on behalf of a population and a broad perspective is required. However, because of their mandate, some
6.3.4.6 Economic model panels might take the perspective of the ministry of health or health department, whereas other panels
6.3.4.7 Consideration of resource use in might take a societal perspective (including all costs, regardless of who pays). Other perspectives (the
distribution of the benefits, harms and costs) should be taken when considering the acceptability of the
recommendations option to key stakeholders.
6.4.1 Wording of recommandations Large or small compared to what?
6.3.2 Symbolic representation Some of the criteria imply a comparison; for example, the size of effects or resource requirements
compared to what? The comparisons or standards that are used are likely to be different for different
6.4.3 Providing transparent statements about organisations, guideline panels and jurisdictions. Some organisations or guideline panels may elect to
assumed values and preferences specify the comparisons or standards that they will use. In the absence of such specified comparisons,
6.5 The EvidencetoDecision framework guideline panel members should consider what their comparisons or standards are when they disagree,
7. The GRADE approach for diagnostic tests and for example, about whether resource requirements are large. When the comparison being used is the
strategies source of their disagreement, they should agree on an appropriate comparison and include this as an
additional consideration in the framework when it is relevant.
7.1.1. Establishing the purpose of a test Guidance for making judgements
7.1.2. Establishing the role of a test Suggestions for how to make judgements in relation to each criterion are provided in Framework for
7.1.3. Clear clinical questions going from evidence to a recommendation – Guidance for health system and public health
7.2. Gold standard and reference test recommendations.
1. Overview of the GRADE Approach For each criterion there are four or five response options, from those that favour a recommendation
against the option on the left to ones that favour a recommendation for the option on the right. In addition,
1.1 Purpose and advantages of the GRADE most of the options include varies as a response option for situations when there is important variation
approach across different settings for which the guidelines are intended and those differences are substantial
1.2 Separation of confidence in effect estimates enough that they might lead to different recommendations for different settings.
Questions to consider for each criterion and their relationship to a recommendation
1.3 Special challenges in applying the the GRADE For each criterion we suggest one or more detailed questions to consider when making a judgement and
approach explain the relationship between the criterion and the recommendation.
2. Framing the health care question Criteria Questions Explanations
2.1 Defining the patient population and intervention Is the problem a Are the consequences of the problem serious The more serious a problem is, the more likely it is that an option
2.2 Dealing with multiple comparators priority? (i.e. severe or important in terms of the that addresses the problem should be a priority (e.g., diseases
potential benefits or savings)? Is the problem that are fatal or disabling are likely to be a higher priority than
2.3 Other considerations urgent? Is it a recognised priority (e.g. based diseases that only cause minor distress). The more people who
2.4 Format of health care questions using the on a national health plan)? Are a large are affected, the more likely it is that an option that addresses
GRADE approach number of people affected by the problem? the problem should be a priority.
3. Selecting and rating the importance of outcomes Is there important How much do those affected by the option The more likely it is that differences in values would lead to
uncertainty about how value each of the outcomes in relation to the different decisions, the less likely it is that there will be a
3.1 Steps for considering the relative importance of much people value the other outcomes (i.e. what is the relative consensus that an option is a priority (or the more important it is
outcomes main outcomes? importance of the outcomes)? Is there likely to be to obtain evidence of the values of those affected by
evidence to support those value judgements, the option). Values in this context refer to the relative importance
3.2 Influence of perspective or is there evidence of variability in those of the outcomes of interest (how much people value each of
3.3 Using evidence in rating the importance of values that is large enough to lead to different those outcomes). These values are sometimes called ‘utility
outcomes decisions? values’.
3.4 Surrogate (substitute) outcomes What is the overall What is the overall certainty of this evidence The less certain the evidence is for critical outcomes (those that
1 of effects, across all of the outcomes that are driving a recommendation), the less likely that an option
4. Summarizing the evidence certainty of the are critical to making a decision? should be recommended (or the more important it is likely to be to
evidence of
4.1 Evidence Tables effectiveness?
conduct a pilot study or impact evaluation, if it is recommended).
4.2 GRADE Evidence Profile How substantial are the How substantial (large)are the desirable The larger the benefit, the more likely it is that an option should
4.3 Summary of Findings table desirable anticipated anticipated effects (including health and other be recommended.
5. Quality of evidence effects? benefits) of the option (taking into account
the severity or importance of the desirable
5.1 Factors determining the quality of evidence consequences and the number of people
5.1.1 Study design affected)?
5.2 Factors that can reduce the quality of the How substantial are the How substantial (large) are the undesirable The greater the harm, the less likely it is that an option should be
evidence undesirable anticipated anticipated effects (including harms to health recommended.
effects? and other harms) of the option (taking into
5.2.1 Study limitations (Risk of Bias) account the severity or importance of the
5.2.2 Inconsistency of results adverse effects and the number of people
affected)?
from a subgroup analysis Do the desirable effects Are the desirable effects large relative to the The larger the desirable effects in relation to the undesirable
outweigh the undesirable effects? effects, taking into account the values of those affected (i.e. the
5.2.3 Indirectness of evidence undesirable effects? relative value they attach to the desirable and undesirable
5.2.4 Imprecision outcomes) the more likely it is that an option should be
recommended.
5.2.4.2 Imprecision in in systematic How large are the How large an investment of resources would The greater the cost, the less likely it is that an option should be
resource the option require or save? a priority. Conversely, the greater the savings, the more likely it
reviews requirements? is that an option should be a priority.
5.2.4.3 Rating down two levels for How large is the Is the cost small relative to the net benefits The greater the cost per unit of benefit, the less likely it is that an
imprecision incremental cost (benefits minus harms)? option should be a priority.
relative to the net
5.2.5 Publication bias benefit?
5.3. Factors that can increase the quality of the What would be the Would the option reduce or increase health Policies or programmes that reduce inequities are more likely to
evidence impact on health inequities? be a priority than ones that do not (or ones that increase
5.3.1 Large magnitude of an effect inequities? inequities).
5.3.2. Doseresponse gradient Is the option Are key stakeholders likely to find the option The less acceptable an option is to key stakeholders, the less
5.3.3. Effect of plausible residual confounding acceptable to key acceptable (given the relative importance likely it is that it should be recommended, or if it is
stakeholders? they attach to the desirable and undesirable recommended, the more likely it is that the recommendation
5.4 Overall quality of evidence consequences of the option; the timing of the should include an implementation strategy to address concerns
6. Going from evidence to recommendations benefits, harms and costs; and their moral about acceptability. Acceptability might reflect who benefits (or is
values)? harmed) and who pays (or saves); and when the benefits,
6.1 Recommendations and their strength adverse effects, and costs occur (and the discount rates of key
values)? harmed) and who pays (or saves); and when the benefits,
6.1 Recommendations and their strength adverse effects, and costs occur (and the discount rates of key
6.1.1 Strong recommendation stakeholders; e.g. politicians may have a high discount rate for
anything that occurs beyond the next election). Unacceptability
6.1.2 Weak recommendation may be due to some stakeholders:
6.1.3 Recommendations to use interventions ● Not accepting the distribution of the benefits, harms and
only in research costs
6.1.4 No recommendation ● Not accepting costs or undesirable effects in the short
term for desirable effects (benefits) in the future
6.2 Factors determining direction and strength of ● Attaching more value (relative importance) to the
recommendations undesirable consequences than to the desirable
6.2.1 Balance of desirable and undesirable consequences or costs of an option (because of how they
consequences might be affected personally or because of their perceptions
6.2.1.1 Estimates of the magnitude of the of the relative importance of consequences for others)
● Morally disapproving (i.e. in relationship to ethical
desirable and undesirable effects principles such as autonomy, nonmaleficence, beneficence
6.2.1.2 Best estimates of values and or justice)
preferences Is the option feasible to Can the option be accomplished or brought The less feasible (capable of being accomplished or brought
6.3.2 Confidence in best estimates of implement? about? about) an option is, the less likely it is that it should be
magnitude of effects (quality of evidence) recommended (i.e. the more barriers there are that would be
difficult to overcome).
1 The “certainty of the evidence” is an assessment the likelihood that the effect will be substantially
6.3.4.1 Differences between costs and different from what the research found.
other outcomes
6.3.4.2 Perspective
6.3.4.3 Resource implications considered Explanations of the conclusions in the framework
6.3.4.4 Confidence in the estimates of Suggestions for how to make judgements in relation to each conclusion are provided in: Framework for
resource use (quality of the evidence about going from evidence to a recommendation – Guidance for health system and public health
recommendations. For each conclusion, we suggest one or more questions to consider when making a
cost) judgement and explain what is needed.
6.3.4.6 Economic model Term Question Explanation
6.3.4.7 Consideration of resource use in Overall judgement What is the overall balance between all the An overall judgement whether the desirable consequences
recommendations across all criteria desirable and undesirable consequences? outweigh the undesirable consequences, or vice versa (based on
6.4 Presentation of recommendations all the research evidence and additional information considered in
relation to all the criteria). Consequences include health and other
6.4.1 Wording of recommandations benefits, adverse effects and other harms, resource use, and
6.3.2 Symbolic representation impacts on equity
6.4.3 Providing transparent statements about Type of Based on the balance of the consequences in A recommendation based on the balance of consequences and
recommendation relation to all of the criteria in the framework, your judgements in relation to all of the criteria, for example:
assumed values and preferences what is your recommendation? ● Not to implement the option
6.5 The EvidencetoDecision framework ● To consider the option only in the context of rigorous
7. The GRADE approach for diagnostic tests and research
strategies ● To consider the option only with specified monitoring and
7.1. Questions about diagnostic tests evaluation
7.1.1. Establishing the purpose of a test ● To consider the option only in specified contexts
● To implement the option
Recommendation (text) What is your recommendation in plain A concise, clear and actionable recommendation
7.1.3. Clear clinical questions language?
7.2. Gold standard and reference test Justification What is the justification for the A concise summary of the reasoning underlying the
recommendation, based on the criteria in the recommendation
framework that drove the recommendation?
Subgroup What, if any, subgroups were considered and A concise summary of the subgroups that were considered and
1.1 Purpose and advantages of the GRADE considerations what, if any, specific factors (based on the any modifications of the recommendation in relation to any of
approach criteria in the framework) should be considered those subgroups
1.2 Separation of confidence in effect estimates in relation to those subgroups when
implementing the option?
Implementation What should be considered when implementing Key considerations, including strategies to address concerns
1.3 Special challenges in applying the the GRADE considerations the option, including strategies to address about acceptability and feasibility, when implementing the option
approach concerns about acceptability and feasibility?
1.4 Modifications to the GRADE approach Monitoring and What indicators should be monitored? Is there a Any important indicators that should be monitored if the option is
2. Framing the health care question evaluation need to evaluate the impacts of the option, implemented
considerations either in a pilot study or an impact evaluation
2.1 Defining the patient population and intervention carried out alongside or before full
2.2 Dealing with multiple comparators implementation of the option?
2.3 Other considerations Research priorities Are there any important uncertainties in relation Any research priorities
2.4 Format of health care questions using the to any of the criteria that are a priority for
further research?
GRADE approach
3.1 Steps for considering the relative importance of Explanations of terms used in summaries of findings
outcomes
3.2 Influence of perspective Term Explanation
3.3 Using evidence in rating the importance of Outcomes These are all the outcomes (potential benefits or harms) that are considered to be important to those affected by the
outcomes intervention, and which are important to making a recommendation or decision. Consultation with those affected by an
3.4 Surrogate (substitute) outcomes intervention (such as patients and their carers) or other members of the public may be used to select the important
outcomes. A review of the literature may also be carried out to inform the selection of the important outcomes. The
4. Summarizing the evidence importance (or value) of each outcome in relation to the other outcomes should also be considered. This is the relative
4.1 Evidence Tables importance of the outcome.
4.2 GRADE Evidence Profile 95% Confidence A confidence interval is a range around an estimate that conveys how precise the estimate is. The confidence interval
4.3 Summary of Findings table Interval (CI) is a guide to how sure we can be about the quantity we are interested in. The narrower the range between the two
5. Quality of evidence numbers, the more confident we can be about what the true value is; the wider the range, the less sure we can be. The
width of the confidence interval reflects the extent to which chance may be responsible for the observed estimate (with a
5.1 Factors determining the quality of evidence wider interval reflecting more chance). 95% Confidence Interval (CI) means that we can be 95 percent confident that
5.1.1 Study design the true size of effect is between the lower and upper confidence limit. Conversely, there is a 5 percent chance that the
true effect is outside of this range.
evidence Relative Effect or Here the relative effect is expressed as a risk ratio (RR). Risk is the probability of an outcome occurring. A risk
RR (Risk Ratio) ratio is the ratio between the risk in the intervention group and the risk in the control group. For example, if the risk in the
5.2.1 Study limitations (Risk of Bias) intervention group is 1% (10 per 1000) and the risk in the control group is 10% (100 per 1000), the relative effect is 10/100
5.2.2 Inconsistency of results or 0.10. If the RR is exactly 1.0, this means that there is no difference between the occurrence of the outcome in the
5.2.2.1 Deciding whether to use estimates intervention and the control group. If the RR is greater than 1.0, the intervention increases the risk of the outcome. If it is
a good outcome (for example, the birth of a healthy baby), a RR greater than 1.0 indicates a desirable effect for the
from a subgroup analysis intervention. Whereas, if the outcome is bad (for example, death) a RR greater than 1.0 would indicate an undesirable
5.2.3 Indirectness of evidence effect. If the RR is less than 1.0, the intervention decreases the risk of the outcome. This indicates a desirable effect, if it
is a bad outcome (for example, death) and an undesirable effect if it is a good outcome (for example, birth of a healthy
5.2.4 Imprecision baby).
Certainty of the The certainty of the evidence is an assessment of how good an indication the research provides of the likely effect; i.e.
5.2.4.2 Imprecision in in systematic evidence the likelihood that the effect will be substantially different from what the research found. By substantially different we
reviews (GRADE)2 mean a large enough difference that it might affect a decision. This assessment is based on an overall assessment of
5.2.4.3 Rating down two levels for reasons for there being more or less certainty using the GRADE approach. In the context of decisions, these
considerations include the applicability of the evidence in a specific context. Other terms may be used synonymously with
imprecision
certainty of the evidence, including quality of the evidence, confidence in the estimate, and strength of the
5.2.5 Publication bias evidence. Definitions of the categories used to rate the certainty of the evidence (high, moderate, low, and very low)
5.3. Factors that can increase the quality of the are provided in the table below.
evidence
5.3.2. Doseresponse gradient Definitions for ratings of the certainty of the evidence
5.3.3. Effect of plausible residual confounding Ratings Definitions
5.4 Overall quality of evidence This research provides a very good indication of the likely effect. The likelihood that the effect will be substantially different is
low.
6. Going from evidence to recommendations High
This research provides a good indication of the likely effect. The likelihood that the effect will be substantially different is
6.1.1 Strong recommendation moderate.
6.1.2 Weak recommendation Moderate
6.1.3 Recommendations to use interventions This research provides some indication of the likely effect. However, the likelihood that it will be substantially different (a large
only in research Low enough difference that it might have an effect on a decision) is high.
This research provides some indication of the likely effect. However, the likelihood that it will be substantially different (a large
only in research Low enough difference that it might have an effect on a decision) is high.
6.1.4 No recommendation This research does not provide a reliable indication of the likely effect. The likelihood that the effect will be substantially different
6.2 Factors determining direction and strength of Very Low (a large enough difference that it might have an effect on a decision) is very high.
recommendations
consequences
preferences 7. The GRADE approach for diagnostic tests
magnitude of effects (quality of evidence) and strategies
6.3.4.1 Differences between costs and Recommendations concerning diagnostic testing share the fundamental logic of recommendations for
other outcomes therapeutic and other interventions, such as screening. However, diagnostic questions also present
6.3.4.2 Perspective unique challenges.
6.3.4.4 Confidence in the estimates of While some tests naturally report positive and negative results (e.g., pregnancy, HIV infection), other
tests report their results as ordinal (e.g., Glasgow coma scale or minimental status examination) or
resource use (quality of the evidence about continuous variable (e.g., metabolic measures), usually with increasing likelihood of disease or adverse
cost) events as the test results become more extreme. For simplicity, in this discussion we generally assume a
6.3.4.5 Presentation of resource use diagnostic approach that ultimately categorizes test results as positive or negative. This also recognizes
6.3.4.6 Economic model that many tests ultimately lead to dichotomized decisions to treat or not to treat.
6.3.4.7 Consideration of resource use in Clinicians and researchers often administer diagnostic tests as a package or strategy composed of
recommendations several tests. Thus, one can often think of evaluating or recommending a diagnostic strategy rather than
6.4 Presentation of recommendations a single test.
6.3.2 Symbolic representation Examples
6.4.3 Providing transparent statements about 1. In managing patients with a diagnosis of cervical intraepithelial neoplasia, a precursor of prevent
cervical cancer, based on visual inspection with acetic acid (VIA) clinicians may proceed to treatment
assumed values and preferences directly or apply a strategy of testing for human papilloma virus and VIA.
7. The GRADE approach for diagnostic tests and 2. Testing strategy may use an initial sensitive but nonspecific test which, if positive, is followed by a
strategies more specific test (e.g., testing for HIV includes the use of an ELISA test followed by quantitative HIV
7.1. Questions about diagnostic tests RNA determination for those with positive results of the ELISA test; but one could ask the question why
quantitative HIV RNA determination alone would not be appropriate).
1.1 Purpose and advantages of the GRADE The format of the question asked by authors of systematic reviews or guideline developers follows the
approach same principles as the format for management questions:
1.2 Separation of confidence in effect estimates Should TEST A vs. TEST B be used in SOME PATIENTS/POPULATION?
from strength of recommendations Should TEST A vs. TEST B be used for SOME PURPOSE?
approach 7.1.1. Establishing the purpose of a test
Guideline panels should be explicit about the purpose of the test in question. Researchers and clinicians
2.1 Defining the patient population and intervention apply medical tests that are usually referred to as “diagnostic” – including signs and symptoms, imaging,
biochemistry, pathology, and psychological testing – for a number of purposes. These applications
2.3 Other considerations include identifying physiological derangements, establishing prognosis, monitoring illness and treatment
2.4 Format of health care questions using the response, screening and diagnosis.
GRADE approach
3.1 Steps for considering the relative importance of 7.1.2. Establishing the role of a test
outcomes
3.2 Influence of perspective Guideline panels and authors of systematic reviews should also clearly establish the role of a diagnostic
3.3 Using evidence in rating the importance of test or strategy. This process should begin with determining the standard diagnostic pathway – or
pathways – for the target patient presentation and identify the associated limitations. Knowing those
outcomes limitations one can identify particular shortcomings for which the alternative diagnostic test or strategy
3.4 Surrogate (substitute) outcomes offers a putative remedy. The purpose of a test under consideration may be for (i) replacement (e.g., of
4. Summarizing the evidence tests with greater burden, invasiveness, cost, or inferior accuracy), (ii), triage (e.g., to minimize use of
4.1 Evidence Tables an invasive or expensive test) or (iii) addon (e.g., to further enhance diagnostic accuracy beyond the
4.2 GRADE Evidence Profile existing diagnostic pathway) (Table 7.1) [Bossuyt 2006; PMID: 16675820].
4.3 Summary of Findings table Table 7.1. Possible roles of new diagnostic tests
5. Quality of evidence Replacement A new test might substitute an old one, because it is more accurate, less invasive,
5.1 Factors determining the quality of evidence less risky or uncomfortable for patients, organizationally or technically less
5.1.1 Study design challenging, quicker to yield results or more easily interpreted, or less costly.
5.2 Factors that can reduce the quality of the Triage A new test is added before the existing diagnostic pathway and only patients with a
particular result on the triage test continue the testing pathway; triage tests are not
evidence necessarily more accurate but usually simpler and less costly.
5.2.1 Study limitations (Risk of Bias) Addon A new test is added after the existing diagnostic pathway and may be used to limit
5.2.2 Inconsistency of results the number of either false positive or false negative results after the existing
5.2.2.1 Deciding whether to use estimates diagnostic pathway; addon tests are usually more accurate but otherwise less
from a subgroup analysis attractive than existing tests.
5.2.4 Imprecision
5.2.4.2 Imprecision in in systematic 7.1.3. Clear clinical questions
reviews
5.2.4.3 Rating down two levels for Clearly establishing the role or purpose of a test or test strategy will lead to the identification of sensible
imprecision clinical questions that, similar to other management problems, have four components: patients, diagnostic
5.2.5 Publication bias intervention (strategy), comparison diagnostic intervention (strategy), and the outcomes of interest.
Examples
evidence 1: In patients suspected of coronary artery disease (patients) should multislice spiral computed
5.3.1 Large magnitude of an effect tomography (CT) of coronary arteries (intervention) be used as replacement for conventional invasive
5.3.2. Doseresponse gradient coronary angiography (comparison) to lower complications with acceptable rates of false negatives
5.3.3. Effect of plausible residual confounding associated with coronary events and false positives leading to unnecessary treatment and complications
5.4 Overall quality of evidence (outcomes)?
This example illustrates one common rationale for a new test – test replacement (coronary CT instead of
6. Going from evidence to recommendations conventional angiography) to avoid complications associated with a more invasive and expensive
6.1 Recommendations and their strength alternative for a condition that can effectively be treated. In this situation, the new test would only need
6.1.1 Strong recommendation to replicate the results of the existing test to demonstrate greater patient net benefit. This assumes that
6.1.2 Weak recommendation the new test similarly categorizes patients at the same stage of the disease and that the consequences of
6.1.3 Recommendations to use interventions the test result, i.e. management decisions and outcomes, are similar.
2: In patients suspected of cow’s milk allergy (CMA), should skin prick tests rather than an oral food
only in research challenge with cow’s milk be used for the diagnosis and management of IgEmediated CMA.
6.1.4 No recommendation 3: In adults cared for in a nonspecialized clinical setting, should serum or plasma cystatin C rather than
6.2 Factors determining direction and strength of serum creatinine concentration be used for the diagnosis and management of renal impairment.
recommendations
consequences
6.2.1.1 Estimates of the magnitude of the 7.2. Gold standard and reference test
6.2.1.2 Best estimates of values and The concept of diagnostic accuracy relies on the presence of a socalled “gold standard”, i.e. a clearly
preferences stated definition of the target disease (i.e. construct of a disease). However, the term “gold standard” is
6.3.2 Confidence in best estimates of ambiguous and not consistently defined. Moreover, constructs of diseases are constantly changing with
magnitude of effects (quality of evidence) progress in understanding biology (e.g. in oncology, with a more molecular understanding of the
6.3.3 Confidence in values and preferences underlying pathologies or Alzheimer’s dementia). We will use the term “gold standard” here as
representing the “perfect” approach to defining or diagnosing the disease or condition of interest, even if
6.3.4 Resource use (cost) the approach is theoretical and based on convention. Following from this definition, diagnostic test
6.3.4.1 Differences between costs and accuracy (e.g. sensitivity and specificity) as a measurement property is not associated with a “gold
other outcomes standard”. We will use the term “reference standard” or reference test for the test or test strategy that
6.3.4.2 Perspective is the current best and accepted approach to making a diagnosis against which a comparison (with an
index test) may be made.
cost)
7.3. Estimating impact on patients
6.3.4.7 Consideration of resource use in It follows that recommendations regarding the use of medical tests require inferences about the
recommendations consequences of falsely identifying patients as having or not having the disease. If a test fails to
6.4 Presentation of recommendations improve patientimportant outcomes there is no reason to use it, whatever its accuracy. Given the
uncertainties about both reference and gold standards and the relation between diagnosis and patient or
6.4.1 Wording of recommandations population consequences, the best way to assess a diagnostic test or strategy would be a testtreat
6.3.2 Symbolic representation randomized controlled trial in which investigators allocate patients to experimental or control diagnostic
6.4.3 Providing transparent statements about approaches and measure patientimportant outcomes (mortality, morbidity, symptoms, quality of life and
assumed values and preferences resource use).
6.5 The EvidencetoDecision framework Figure 1. Generic study designs that guideline developers can use to evaluate the impact of
7. The GRADE approach for diagnostic tests and testing.
strategies

approach
approach
GRADE approach
outcomes
3.3 Using evidence in rating the importance of Example Example
outcomes Randomized controlled trials (RCTs) that explored a Consistent evidence from well designed studies
diagnostic strategy guided by the use of B‐type demonstrates fewer false negative results with non‐
3.4 Surrogate (substitute) outcomes natriuretic peptide (BNP) – designed to aid diagnosis contrast helical CT than with intravenous
4. Summarizing the evidence of heart failure – compared with no use of BNP in pyelography (IVP) in the diagnosis of suspected
4.1 Evidence Tables patients presenting to the emergency department acute urolithiasis. However, the stones in the ureter
with acute dyspnea. As it turned out, the group that CT detects but IVP “misses” are smaller, and
4.2 GRADE Evidence Profile randomized to receive BNP spent a shorter time in hence are likely to pass more easily. Since RCTs
4.3 Summary of Findings table the hospital at lower cost with no increased mortality evaluating the outcomes in patients treated for
5. Quality of evidence or morbidity. smaller stones are not available, the extent to which
5.1 Factors determining the quality of evidence reduction in cases that are missed (false negatives)
and follow‐up of incidental 阇indings unrelated to
5.1.1 Study design renal calculi with CT have important health bene阇its
5.2 Factors that can reduce the quality of the remains uncertain.
evidence
Two generic ways in which one can evaluate a test or diagnostic strategy: a) Patients are randomized to a
5.2.2 Inconsistency of results new test or strategy or, alternatively, to an old test or strategy. Those with a positive test (cases detected) are
5.2.2.1 Deciding whether to use estimates randomized (or were previously randomized) to receive the best available management (second step of
from a subgroup analysis randomization for management not shown in this 阇igure). Investigators evaluate and compare patient‐
important outcomes in all patients in both groups. b) Patients receive both a new test and a reference test
5.2.3 Indirectness of evidence (it often, however, is the old or comparator test or strategy). Investigators can then calculate the accuracy of
5.2.4 Imprecision the test compared to the reference test (阇irst step). To make judgments about patient‐importance of this
5.2.4.1 Imprecision in guidelines information, patients with a positive test (or strategy) in either group are (or have been in previous studies)
5.2.4.2 Imprecision in in systematic submitted to treatment or no treatment; investigators then evaluate and compare patient‐important
outcomes in all patients in both groups (second step).
reviews
imprecision
5.2.5 Publication bias When diagnostic intervention studies (RCTs or observational studies) comparing alternative diagnostic
5.3. Factors that can increase the quality of the strategies with assessment of direct patientimportant outcomes are available, guideline panels can use
evidence the GRADE approach for other interventions.
5.3.1 Large magnitude of an effect If studies measuring the impact of testing on patientimportant or populationimportant outcomes are not
5.3.2. Doseresponse gradient available, guideline panels must focus on other studies, such as diagnostic test accuracy studies, and
5.3.3. Effect of plausible residual confounding make inferences about the likely impact of using alternative tests on patientimportant outcomes. In the
5.4 Overall quality of evidence latter situation, diagnostic accuracy can be considered a surrogate outcome for patientimportant benefits
6. Going from evidence to recommendations and harms.
6.1.1 Strong recommendation Key questions when using test accuracy as a surrogate are:
● what outcomes can those labeled as cases and those labeled as not having a disease expect
6.1.2 Weak recommendation based on the knowledge about the best available management?
6.1.3 Recommendations to use interventions ● will there be a reduction in false negatives (cases missed) or false positives and corresponding
only in research increases in true positives and true negatives?
6.1.4 No recommendation ● how similar (or different) are people to whom the test is applied and classified accurately by
6.2 Factors determining direction and strength of the alternative testing strategies to those evaluated in studies?
recommendations
consequences
7.4. Indirect evidence and impact on patientimportant
desirable and undesirable effects outcomes
outcomes
preferences
6.3.2 Confidence in best estimates of A recommendation associated with a diagnostic question follows from an evaluation of the balance
between the desirable and undesirable consequences of the diagnostic test or strategy. It should be based
magnitude of effects (quality of evidence) on a systematic review addressing the clinical question as well as information about management after
6.3.3 Confidence in values and preferences applying the diagnostic test.
6.3.4.1 Differences between costs and Inferring from accuracy data that a diagnostic test or strategy improves patientimportant outcome
usually requires access to effective management. Alternatively, even with no effective treatment being
other outcomes available, using an accurate test may be beneficial, if it reduces adverse effects, cost or the anxiety
6.3.4.2 Perspective through excluding an ominous diagnosis, or if confirming a diagnosis improves patient wellbeing from
6.3.4.3 Resource implications considered the prognostic information it imparts. Before drawing such inferences judgments about the confidence in
6.3.4.4 Confidence in the estimates of diagnostic accuracy information is required.
cost)
6.3.4.6 Economic model 7.5. Judgment about the quality of the underlying evidence
recommendations As described above, when studies as described in Figure 1a are available, the approach to assessing the
6.4 Presentation of recommendations confidence in effect estimates (quality of evidence) described for other interventions in prior articles in
6.4.1 Wording of recommandations this series should be used. The rest of the current article focuses on the situation when such direct data
6.3.2 Symbolic representation on patientimportant outcomes are lacking and the body of evidence is derived from DTA studies. Thus,
in this article, we will provide guidance for assessing the confidence in estimates for those synthesizing
6.4.3 Providing transparent statements about information from DTA studies, e.g. authors of systematic reviews. Summary of findings (SoF) tables and
assumed values and preferences GRADE evidence profiles provide transparent accounts of this information by summarizing numerical
6.5 The EvidencetoDecision framework information and ratings of the confidence in these estimates.
strategies
7.1. Questions about diagnostic tests 7.5.1. Initial study design
7.2. Gold standard and reference test In a typical test accuracy study, a consecutive series of patients suspected for a particular condition are
subjected to the index test (the test being evaluated) and then all patients receive a reference or gold
standard (the best available method to establish the presence of the target condition). While in the
1. Overview of the GRADE Approach GRADE approach appropriate accuracy studies (see below) start as high quality evidence about
1.1 Purpose and advantages of the GRADE diagnostic accuracy, these studies are vulnerable to limitations and often lead to low quality evidence to
approach support guideline recommendations, mostly owing to indirectness of evidence associated with diagnostic
1.2 Separation of confidence in effect estimates accuracy being only a surrogate for patient outcomes.
approach 7.5.2. Factors that determine and can decrease the quality of evidence
2.1 Defining the patient population and intervention Table 7.2. Factors that decrease the quality of evidence for studies of diagnostic accuracy and how
2.2 Dealing with multiple comparators they differ from evidence for other interventions
2.3 Other considerations Factors that determine and can decrease the Explanations and how the factor may differ from
2.4 Format of health care questions using the quality of evidence the quality of evidence for other interventions
GRADE approach Study design Different criteria for accuracy studies
Crosssectional or cohort studies in patients with
3. Selecting and rating the importance of outcomes diagnostic uncertainty and direct comparison of test
3.1 Steps for considering the relative importance of results with an appropriate reference standard (best
outcomes possible alternative test strategy) are considered
3.2 Influence of perspective high quality and can move to moderate, low or very
3.3 Using evidence in rating the importance of low depending on other factors.
Risk of bias (limitations in study design and Different criteria for accuracy studies
outcomes execution) 6. Representativeness of the population that
3.4 Surrogate (substitute) outcomes was intended to be sampled.
4. Summarizing the evidence 7. Independent comparison with the best
4.1 Evidence Tables alternative test strategy.
4.2 GRADE Evidence Profile 8. All enrolled patients should receive the
new test and the best alternative test
4.3 Summary of Findings table strategy.
5. Quality of evidence 9. Diagnostic uncertainty should be given.
5.1 Factors determining the quality of evidence 10. Is the reference standard likely to
5.1.1 Study design correctly classify the target condition?
5.2 Factors that can reduce the quality of the Indirectness Similar criteria
Patient population, diagnostic test, comparison The quality of evidence can be lowered if there are
evidence test and indirect comparisons of tests important differences between the populations
5.2.1 Study limitations (Risk of Bias) studied and those for whom the recommendation is
5.2.2 Inconsistency of results intended (in prior testing, the spectrum of disease or
5.2.2.1 Deciding whether to use estimates comorbidity); if there are important differences in
the tests studied and the diagnostic expertise of
from a subgroup analysis those applying them in the studies compared to the
5.2.3 Indirectness of evidence settings for which the recommendations are
5.2.4 Imprecision intended; or if the tests being compared are each
5.2.4.1 Imprecision in guidelines compared to a reference (gold) standard in different
5.2.4.2 Imprecision in in systematic studies and not directly compared in the same
studies.
reviews
5.2.4.3 Rating down two levels for Similar criteria
imprecision Panels assessing diagnostic tests often face an
5.2.5 Publication bias absence of direct evidence about impact on patient
5.3. Factors that can increase the quality of the important outcomes. They must make deductions
from diagnostic test studies about the balance
evidence between the presumed influences on patient
5.3.1 Large magnitude of an effect important outcomes of any differences in true and
5.3.2. Doseresponse gradient false positives and true and false negatives in
5.3.3. Effect of plausible residual confounding relationship to test complications and costs.
5.4 Overall quality of evidence Therefore, accuracy studies typically provide low
quality evidence for making recommendations due
6. Going from evidence to recommendations to indirectness of the outcomes, similar to surrogate
6.1 Recommendations and their strength outcomes for treatments.
6.1.1 Strong recommendation Important Inconsistency in study results Similar criteria
6.1.2 Weak recommendation For accuracy studies unexplained inconsistency in
6.1.3 Recommendations to use interventions sensitivity, specificity or likelihood ratios (rather
than relative risks or mean differences) can lower
only in research the quality of evidence.
6.1.4 No recommendation Imprecise evidence Similar criteria
6.2 Factors determining direction and strength of For accuracy studies wide confidence intervals for
recommendations estimates of test accuracy, or true and false positive
6.2.1 Balance of desirable and undesirable and negative rates can lower the quality of
evidence.
consequences High probability of Publication bias Similar criteria
6.2.1.1 Estimates of the magnitude of the A high risk of publication bias (e.g., evidence only
desirable and undesirable effects from small studies supporting a new test, or
6.2.1.2 Best estimates of values and asymmetry in a funnel plot) can lower the quality of
preferences evidence.
Upgrading for dose effect, large effects residual Similar criteria
6.3.2 Confidence in best estimates of plausible bias and confounding For all of these factors, methods have not been
magnitude of effects (quality of evidence) properly developed. However, determining a dose
magnitude of effects (quality of evidence) properly developed. However, determining a dose
6.3.3 Confidence in values and preferences effect (e.g., increasing levels of anticoagulation
6.3.4 Resource use (cost) measured by INR increase the likelihood for
6.3.4.1 Differences between costs and vitamin K deficiency or vitamin K antagonists). A
very large likelihood of disease (not of patient
other outcomes important outcomes) associated with test results
6.3.4.2 Perspective may increase the quality evidence. However, there
6.3.4.3 Resource implications considered is some disagreement if and how dose effects play
6.3.4.4 Confidence in the estimates of a role in assessing the quality of evidence in DTA
studies.
cost)
6.3.4.7 Consideration of resource use in 7.5.2.1. Risk of bias
recommendations
6.4.1 Wording of recommandations Several instruments for the evaluation of risk of bias in DTA studies are available. Cochrane
6.3.2 Symbolic representation Collaboration suggests a selection of the items from the QUADAS [Whiting 2003; PMID 14606960] and
6.4.3 Providing transparent statements about QUADAS2 [Whiting 2011; PMID 22007046] instruments. Authors of systematic reviews and guideline
assumed values and preferences panels can use the criteria from the QUADAS list (Table 7.3) to assess the risk of bias within and
across studies.
7. The GRADE approach for diagnostic tests and Serious limitations in a body of evidence that indicate risk of bias, if found, will likely lead to
strategies downgrading the quality of evidence by one or two levels.
7.1.1. Establishing the purpose of a test Table 7.3. Quality criteria of diagnostic accuracy studies derived from QUADAS (Reitsma 2009; http://srdta.cochrane.org/)
1. Was the spectrum of patients representative of the patients
7.1.2. Establishing the role of a test who will receive the test in practice? (representative
7.1.3. Clear clinical questions spectrum)
7.2. Gold standard and reference test 2. Is the reference standard likely to classify the target
condition correctly? (acceptable reference standard)
3. Is the time period between reference standard and index test
1. Overview of the GRADE Approach short enough to be reasonably sure that the target condition
1.1 Purpose and advantages of the GRADE did not change between the two tests? (acceptable delay
between tests)
approach 4. Did the whole sample or a random selection of the sample,
1.2 Separation of confidence in effect estimates receive verification using the intended reference standard?
(partial verification avoided)
from strength of recommendations 5. Did patients receive the same reference standard irrespective
1.3 Special challenges in applying the the GRADE of the index test result? (differential verification avoided)
approach 6. Was the reference standard independent of the index test (i.e.
the index test did not form part of the reference standard)?
1.4 Modifications to the GRADE approach (incorporation avoided)
2. Framing the health care question 7. Were the reference standard results interpreted without
2.1 Defining the patient population and intervention knowledge of the results of the index test? (index test results
blinded)
2.2 Dealing with multiple comparators 8. Were the index test results interpreted without knowledge of
2.3 Other considerations the results of the reference standard? (reference standard
results blinded)
2.4 Format of health care questions using the 9. Were the same clinical data available when test results were
GRADE approach interpreted as would be available when the test is used in
3. Selecting and rating the importance of outcomes practice? (relevant clinical information)
10. Were uninterpretable/intermediate test results reported?
3.1 Steps for considering the relative importance of (uninterpretable results reported)
outcomes 11. Were withdrawals from the study explained? (withdrawals
3.2 Influence of perspective explained)
outcomes Table 7.4. Quality criteria of diagnostic accuracy studies derived from QUADAS2
3.4 Surrogate (substitute) outcomes Domain Patient Selection Index Test Reference Standard Flow and Timing
4. Summarizing the evidence Description Describe methods of Describe the index test Describe the reference Describe any patients
patient and how it was standard and how it who did not receive the
4.1 Evidence Tables selection conducted and was conducted and index tests or reference
4.2 GRADE Evidence Profile Describe included interpreted interpreted standard or who were
patients excluded from the 2 X
4.3 Summary of Findings table (previous testing, 2 table (refer to flow
5. Quality of evidence presentation, intended diagram)
5.1 Factors determining the quality of evidence use of index test, and Describe the interval
setting) and any interventions
5.1.1 Study design between index tests
5.2 Factors that can reduce the quality of the and the reference
standard
evidence
5.2.1 Study limitations (Risk of Bias) Signaling questions Was a consecutive or Were the index test Is the reference Was there an
5.2.2 Inconsistency of results (yes, no, or unclear) random sample of results interpreted standard likely to appropriate interval
patients enrolled? without know ledge correctly classify the between index tests
5.2.2.1 Deciding whether to use estimates Was a case–control of the results of the target condition? and reference
from a subgroup analysis design avoided? reference standard? Were the reference standard?
5.2.3 Indirectness of evidence Did the study avoid If a threshold was standard results Did all patients receive
inappropriate used, was it pre interpreted without a reference standard?
5.2.4 Imprecision exclusions? specified? knowledge of the Did all patients receive
5.2.4.1 Imprecision in guidelines results of the index the same reference
test? standard?
5.2.4.2 Imprecision in in systematic Were all patients
reviews included in the
5.2.4.3 Rating down two levels for analysis?
imprecision Risk of bias (high, low, Could the selection of Could the conduct or Could the reference Could the patient flow
5.2.5 Publication bias or unclear) patients have interpretation of the standard, its conduct, have introduced bias?
5.3. Factors that can increase the quality of the introduced bias? index test have or its interpretation
introduced bias? have introduced bias?
evidence
5.4 Overall quality of evidence 7.5.2.2. Indirectness of the evidence
6.1.1 Strong recommendation Judging indirectness of the evidence presents additional and probably greater challenges for authors of
6.1.2 Weak recommendation systematic reviews of diagnostic test accuracy and for guideline panels making recommendations about
6.1.3 Recommendations to use interventions diagnostic tests. First, as with therapeutic interventions, indirectness must be assessed in relation to the
only in research population, setting, the intervention (the new or index test) and the comparator (another investigated test
or the reference standard). For instance, a judgment of indirectness of the population can result from
6.1.4 No recommendation using a different test setting such as the patients seen in an emergency department may differ from
6.2 Factors determining direction and strength of patients seen in a general practitioner office, the patients included in the studies of interest may differ or
recommendations the target condition of the population is not the same in the studies compared to the question asked.
If the clinical question is about the choice between two tests, neither of which is a reference standard,
consequences one needs to assess whether the two tests were compared directly against each other and the reference
6.2.1.1 Estimates of the magnitude of the test in the same study, or in separate studies in which each test was compared separately against the
desirable and undesirable effects reference standard. For example, a systematic review comparing the diagnostic accuracy of two tests for
6.2.1.2 Best estimates of values and renal insufficiency – serum creatinine and serum cystatin C – identified a number of studies that
preferences performed serum tests for both creatinine and cystatin C and the reference standard in the same patients
(Table 7.5).
magnitude of effects (quality of evidence) Table 7.5. Diagnostic accuracy SoF table: cystatin vs. creatinine in diagnosis of renal failure
other outcomes
other outcomes
6.3.4.2 Perspective
cost)
recommendations
strategies
7.1. Questions about diagnostic tests Unlike for management questions, if only diagnostic accuracy information is available, the
assessment of indirectness requires additional judgments about how the correct and incorrect
7.1.1. Establishing the purpose of a test classification of subjects as having or not having a target condition relates to patient important
7.1.2. Establishing the role of a test outcomes. While authors of systematic reviews will frequently skip this assessment because their
7.2. Gold standard and reference test interest may relate only to the review of the diagnostic accuracy, guideline panels must always make
this judgment – either implicitly or, better, explicitly and transparently.
approach
1.2 Separation of confidence in effect estimates 7.5.2.3. Inconsistency, imprecision, publication bias and upgrading for dose effect, large
from strength of recommendations estimates of accuracy and residual plausible confounding
approach
1.4 Modifications to the GRADE approach Although these criteria are applicable to a body of evidence from studies of diagnostic test accuracy, the
2. Framing the health care question methods to determine whether a particular criterion is met are less well established compared with the
2.1 Defining the patient population and intervention evidence about the effects of therapeutic interventions. Further theoretical and empirical work is
2.2 Dealing with multiple comparators required to provide guidance how to assess those criteria.
GRADE approach 7.5.3. Overall confidence in estimates of effects
outcomes Tables 7.6 and 7.7 show the assessment of the confidence in the estimates and the SoF table of all
critical outcomes for the comparison of computed tomography (CT) angiography with an invasive
3.2 Influence of perspective angiography (the reference standard) in patients suspected of coronary artery disease.
outcomes Table 7.6. . Quality assessment of diagnostic accuracy studies – example: should multislice spiral
3.4 Surrogate (substitute) outcomes computed tomography instead of conventional coronary angiography be used for diagnosis of coronary
4. Summarizing the evidence artery disease?
4.1 Evidence Tables
5.1.1 Study design
evidence
5.2.4 Imprecision
reviews
imprecision
5.3. Factors that can increase the quality of the Table 7.7. . Summary of findings of all critical outcomes for the comparison of computed tomography
evidence (CT) angiography with an invasive angiography (the reference standard) in patients suspected of
5.3.1 Large magnitude of an effect coronary artery disease.
only in research
recommendations
consequences
preferences
6.3.2 Confidence in best estimates of The original accuracy studies were well planned and executed, the results are precise, and one does not
suspect relevant publication bias. However, there are problems with inconsistency. Reviewers
magnitude of effects (quality of evidence) addressing the relative merits of CT versus invasive angiography for diagnosis of coronary disease found
6.3.3 Confidence in values and preferences important heterogeneity in the results for the proportion of invasive angiographynegative patients with a
6.3.4 Resource use (cost) positive CT test result (specificity) and in the results for the proportion of angiographypositive patients
6.3.4.1 Differences between costs and with a negative CT test result (sensitivity) that they could not explain (Figure 2). This heterogeneity was
other outcomes also present for other measures of diagnostic test accuracy (i.e. positive and negative likelihood ratios
and diagnostic odds ratios). Unexplained heterogeneity in the results across studies reduced the quality
6.3.4.2 Perspective of evidence for all outcomes.
resource use (quality of the evidence about Figure 2. Example for heterogeneity in diagnostic test results
resource use (quality of the evidence about Figure 2. Example for heterogeneity in diagnostic test results
cost)
recommendations
strategies

approach
approach
GRADE approach
outcomes
outcomes
4.1 Evidence Tables
5.1.1 Study design
evidence
5.2.4 Imprecision
reviews
imprecision
evidence
only in research
6.2 Factors determining direction and strength of Sensitivity and specificity of multislice coronary CT compared with coronary angiogram (from
recommendations reference 4). This heterogeneity also existed for likelihood ratios and diagnostic odds ratios.
consequences One of the aims of the GRADE Working Group is to reduce unnecessary confusion arising from multiple
6.2.1.1 Estimates of the magnitude of the systems for grading quality of evidence and strength of recommendations. To avoid adding to this
desirable and undesirable effects confusion by having multiple variations of the GRADE system we suggest that the criteria below should
6.2.1.2 Best estimates of values and be met when saying that the GRADE approach was used. Also, while users may believe there are good
preferences reasons for modifying the GRADE system, we discourage the use of “modified” GRADE approaches
6.3.2 Confidence in best estimates of that differ substantially from the approach described by the GRADE Working Group.
magnitude of effects (quality of evidence) However, we encourage and welcome constructive criticism of the GRADE approach, suggestions for
6.3.3 Confidence in values and preferences improvements, and involvement in the GRADE Working Group. As most scientific approaches to
6.3.4 Resource use (cost) advancing healthcare, the GRADE approach will continue to evolve in response to new research and to
6.3.4.1 Differences between costs and meet the needs of authors of systematic reviews, guideline developers and other users.
other outcomes
Checklist: Suggested criteria for stating that the GRADE system was used
6.3.4.2 Perspective
6.3.4.3 Resource implications considered 1. Definition of quality of evidence: The quality of evidence (confidence in the estimated
6.3.4.4 Confidence in the estimates of effects) should be defined consistently with the definitions (for guidelines or for systematic
resource use (quality of the evidence about reviews) used by the GRADE Working Group.
cost) 2. Criteria for assessing the quality of evidence: Explicit consideration should be given to
6.3.4.5 Presentation of resource use each of the eight GRADE criteria for assessing the quality of evidence (risk of bias, directness
6.3.4.6 Economic model of evidence, consistency and precision of results, risk of publication bias, magnitude of the
6.3.4.7 Consideration of resource use in effect, doseresponse gradient, and influence of residual plausible confounding) although
6.3.4.7 Consideration of resource use in effect, doseresponse gradient, and influence of residual plausible confounding) although
recommendations different terminology may be used.
6.4 Presentation of recommendations 3. Quality of evidence for each outcome: The quality of evidence (confidence in the estimated
6.4.1 Wording of recommandations effects) should be assessed for each important outcome and expressed using four categories
6.3.2 Symbolic representation (e.g. high, moderate, low, very low) or, if justified, three categories (e.g. high, moderate, and
6.4.3 Providing transparent statements about low [low and very low being reduced to one category]) based on consideration of the above
assumed values and preferences factors (see point 2) with suggested interpretation of each category that is consistent with the
6.5 The EvidencetoDecision framework interpretation used by the GRADE Working Group.
4. Summaries of evidence: Evidence tables or detailed narrative summaries of evidence,
strategies transparently describing judgements about the factors in point 2 above, should be used as the
7.1. Questions about diagnostic tests basis for judgements about the quality of evidence and the strength of recommendations. Ideally,
7.1.1. Establishing the purpose of a test full evidence profiles suggested by the GRADE Working Group should be used and these should
7.1.2. Establishing the role of a test be based on systematic reviews. At a minimum, the evidence that was assessed and the methods
7.1.3. Clear clinical questions that were used to identify and appraise that evidence should be clearly described. In particular,
7.2. Gold standard and reference test reasons for downgrading and upgrading the quality of evidence should be described
transparently.
1. Overview of the GRADE Approach 5. Criteria for determining the strength of a recommendation: Explicit consideration should
1.1 Purpose and advantages of the GRADE be given to each of the four GRADE criteria for determining the strength of a recommendation
approach (the balance of desirable and undesirable consequences, quality of evidence, values and
1.2 Separation of confidence in effect estimates preferences of those affected, and resource use) and a general approach should be reported (e.g.
from strength of recommendations if and how costs were considered, whose values and preferences were assumed, etc.).
6. Strength of recommendation terminology: The strength of recommendation for or against a
approach specific management option should be expressed using two categories (weak and strong) and the
1.4 Modifications to the GRADE approach definitions/interpretation for each category should be consistent with those used by the GRADE
2. Framing the health care question Working Group. Different terminology to express weak and strong recommendations may be
2.1 Defining the patient population and intervention used (e.g. alternative wording for weak recommendations is conditional), although the
2.2 Dealing with multiple comparators interpretation and implications should be preserved.
2.4 Format of health care questions using the 7. Reporting of judgements: Ideally, decisions about the strength of the recommendations
should be transparently reported.
GRADE approach
outcomes
3.3 Using evidence in rating the importance of 8. Criteria for determining whether the
outcomes
GRADE approach was used
4.1 Evidence Tables One of the aims of the GRADE Working Group is to reduce unnecessary confusion arising from multiple
4.2 GRADE Evidence Profile systems for grading quality of evidence and strength of recommendations. To avoid adding to this
4.3 Summary of Findings table confusion by having multiple variations of the GRADE system we suggest that the criteria below should
5. Quality of evidence be met when saying that the GRADE approach was used. Also, while users may believe there are good
5.1 Factors determining the quality of evidence reasons for modifying the GRADE system, we discourage the use of “modified” GRADE approaches
5.1.1 Study design that differ substantially from the approach described by the GRADE Working Group.
5.2 Factors that can reduce the quality of the However, we encourage and welcome constructive criticism of the GRADE approach, suggestions for
evidence improvements, and involvement in the GRADE Working Group. As most scientific approaches to
5.2.1 Study limitations (Risk of Bias) advancing healthcare, the GRADE approach will continue to evolve in response to new research and to
5.2.2 Inconsistency of results meet the needs of authors of systematic reviews, guideline developers and other users.
5.2.2.1 Deciding whether to use estimates Checklist: Suggested criteria for stating that the GRADE system was used
from a subgroup analysis 1. Definition of quality of evidence: The quality of evidence (confidence in the estimated
5.2.3 Indirectness of evidence effects) should be defined consistently with the definitions (for guidelines or for systematic
5.2.4 Imprecision reviews) used by the GRADE Working Group.
5.2.4.1 Imprecision in guidelines 2. Criteria for assessing the quality of evidence: Explicit consideration should be given to
5.2.4.2 Imprecision in in systematic each of the eight GRADE criteria for assessing the quality of evidence (risk of bias, directness
reviews of evidence, consistency and precision of results, risk of publication bias, magnitude of the
5.2.4.3 Rating down two levels for effect, doseresponse gradient, and influence of residual plausible confounding) although
imprecision different terminology may be used.
5.2.5 Publication bias 3. Quality of evidence for each outcome: The quality of evidence (confidence in the estimated
5.3. Factors that can increase the quality of the effects) should be assessed for each important outcome and expressed using four categories
evidence (e.g. high, moderate, low, very low) or, if justified, three categories (e.g. high, moderate, and
5.3.1 Large magnitude of an effect low [low and very low being reduced to one category]) based on consideration of the above
5.3.2. Doseresponse gradient factors (see point 2) with suggested interpretation of each category that is consistent with the
5.3.3. Effect of plausible residual confounding interpretation used by the GRADE Working Group.
5.4 Overall quality of evidence 4. Summaries of evidence: Evidence tables or detailed narrative summaries of evidence,
6. Going from evidence to recommendations transparently describing judgements about the factors in point 2 above, should be used as the
6.1 Recommendations and their strength basis for judgements about the quality of evidence and the strength of recommendations. Ideally,
6.1.1 Strong recommendation full evidence profiles suggested by the GRADE Working Group should be used and these should
6.1.2 Weak recommendation be based on systematic reviews. At a minimum, the evidence that was assessed and the methods
6.1.3 Recommendations to use interventions that were used to identify and appraise that evidence should be clearly described. In particular,
only in research reasons for downgrading and upgrading the quality of evidence should be described
6.1.4 No recommendation transparently.
6.2 Factors determining direction and strength of 5. Criteria for determining the strength of a recommendation: Explicit consideration should
recommendations be given to each of the four GRADE criteria for determining the strength of a recommendation
6.2.1 Balance of desirable and undesirable (the balance of desirable and undesirable consequences, quality of evidence, values and
consequences preferences of those affected, and resource use) and a general approach should be reported (e.g.
6.2.1.1 Estimates of the magnitude of the if and how costs were considered, whose values and preferences were assumed, etc.).
desirable and undesirable effects 6. Strength of recommendation terminology: The strength of recommendation for or against a
6.2.1.2 Best estimates of values and specific management option should be expressed using two categories (weak and strong) and the
preferences definitions/interpretation for each category should be consistent with those used by the GRADE
6.3.2 Confidence in best estimates of Working Group. Different terminology to express weak and strong recommendations may be
magnitude of effects (quality of evidence) used (e.g. alternative wording for weak recommendations is conditional), although the
6.3.3 Confidence in values and preferences interpretation and implications should be preserved.
6.3.4 Resource use (cost) 7. Reporting of judgements: Ideally, decisions about the strength of the recommendations
6.3.4.1 Differences between costs and should be transparently reported.
other outcomes
6.3.4.2 Perspective
cost)
9. Glossary of terms and concepts
6.3.4.6 Economic model This glossary is partially based on the glossary of the Cochrane Collaboration and the Users' Guides to
6.3.4.7 Consideration of resource use in the Medical Literature with permission.
recommendations Absolute risk reduction (ARR): Synonym of the risk difference (RD). The difference in the risk
6.4 Presentation of recommendations between two groups. For example, if one group has a 15% risk of contracting a particular disease, and
6.4.1 Wording of recommandations the other has a 10% risk of getting the disease, the risk difference is 5 percentage points.
6.3.2 Symbolic representation Baseline risk: synonym of control group risk.
assumed values and preferences Bias: A systematic error or deviation in results or inferences from the truth. In studies of the effects of
6.5 The EvidencetoDecision framework health care, the main types of bias arise from systematic differences in the groups that are compared
7. The GRADE approach for diagnostic tests and (selection bias), the care that is provided, exposure to other factors apart from the intervention of
strategies interest (performance bias), withdrawals or exclusions of people entered into a study (attrition bias) or
7.1. Questions about diagnostic tests how outcomes are assessed (detection bias). Systematic reviews of studies may also be particularly
affected by reporting bias, where a biased subset of all the relevant data is available.
7.1.2. Establishing the role of a test Burden: ;Burdens are the demands that patients or caregivers (e.g. family) may dislike, such as having
7.1.3. Clear clinical questions to take medication or the inconvenience of going to the doctor’s office.
7.2. Gold standard and reference test Case series: A study reporting observations on a series of individuals, usually all receiving the same
intervention, with no control group.
1. Overview of the GRADE Approach Case report: A study reporting observations on a single individual. Also called: anecdote, case history,
1.1 Purpose and advantages of the GRADE or case study.
approach
1.2 Separation of confidence in effect estimates Casecontrol study: An observational study that compares people with a specific disease or outcome of
from strength of recommendations interest (cases) to people from the same population without that disease or outcome (controls), and
1.3 Special challenges in applying the the GRADE which seeks to find associations between the outcome and prior exposure to particular risk factors. This
approach design is particularly useful where the outcome is rare and past exposure can be reliably measured.
Casecontrol studies are usually retrospective, but not always.
2. Framing the health care question Categorical data: Data that are classified into two or more nonoverlapping categories. Gender and type
2.1 Defining the patient population and intervention of drug (aspirin, paracetamol, etc.) are examples of categorical variables.
2.2 Dealing with multiple comparators Clinical practice guideline (CPG): A systematically developed statement to assist practitioner and
2.3 Other considerations patient decisions about appropriate health care for specific clinical circumstances.
GRADE approach Cohort study: An observational study in which a defined group of people (the cohort) is followed over
3. Selecting and rating the importance of outcomes time. The outcomes of people in subsets of this cohort are compared, to examine people who were
exposed or not exposed (or exposed at different levels) to a particular intervention or other factor of
interest. A prospective cohort study assembles participants and follows them into the future.
outcomes A retrospective (or historical) cohort study identifies subjects from past records and follows them from
3.2 Influence of perspective the time of those records to the present.
outcomes Comparison: intervention against which new intervention is compared, control group.
3.4 Surrogate (substitute) outcomes Confidence interval (CI): A measure of the uncertainty around the main finding of a statistical
4. Summarizing the evidence analysis. Estimates of unknown quantities, such as the RR comparing an experimental intervention with
4.1 Evidence Tables a control, are usually presented as a point estimate and a 95% confidence interval. This means that if
4.2 GRADE Evidence Profile someone were to keep repeating a study in other samples from the same population, 95% of the
4.3 Summary of Findings table calculated confidence intervals from those studies would include the true underlying value. Conceptually
5. Quality of evidence easier than this definition is to think of the CI as the range in which the truth plausibly lies. Wider
5.1 Factors determining the quality of evidence intervals indicate less precision; narrow intervals, greater precision. Alternatives to 95%, such as 90%
5.1.1 Study design and 99% confidence intervals, are sometimes used.
5.2 Factors that can reduce the quality of the Confounder: A factor that is associated with both an intervention (or exposure) and the outcome of
evidence interest. For example, if people in the experimental group of a controlled trial are younger than those in
5.2.1 Study limitations (Risk of Bias) the control group, it will be difficult to decide whether a lower risk of death in one group is due to the
5.2.2 Inconsistency of results intervention or the difference in ages. Age is then said to be a confounder, or a confounding variable.
5.2.2.1 Deciding whether to use estimates Randomisation is used to minimise imbalances in confounding variables between experimental and
from a subgroup analysis control groups. Confounding is a major concern in nonrandomised studies.
5.2.3 Indirectness of evidence Consumer (healthcare consumer): Someone who uses, is affected by, or who is entitled to use a health
5.2.4 Imprecision related service.
5.2.4.2 Imprecision in in systematic Context: The conditions and circumstances that are relevant to the application of an intervention, for
reviews example the setting (in hospital, at home, in the air); the time (working day, holiday, nighttime); type of
practice (primary, secondary, tertiary care; private practice, insurance practice, charity); whether routine
or emergency. Also called clinical situation.
imprecision
5.2.5 Publication bias Continuous data: Data with a potentially infinite number of possible values within a given range.
5.3. Factors that can increase the quality of the Height, weight and blood pressure are examples of continuous variables.
evidence Control: In a controlled trial a control is a participant in the arm that acts as a comparator for one or
5.3.1 Large magnitude of an effect more experimental interventions. Controls may receive placebo, no treatment, standard treatment, or an
5.3.2. Doseresponse gradient active intervention, such as a standard drug. In an observational study a control is a person in the group
5.3.3. Effect of plausible residual confounding without the disease or outcome of interest.
Control Group Risk: observed risk of the event in the control group. Synonym of baseline risk. The
6. Going from evidence to recommendations control group risk for an outcome is calculated by dividing the number of people with an outcome in
6.1 Recommendations and their strength control group by the total number of participants in the control group.
6.1.2 Weak recommendation Critical appraisal: The process of assessing and interpreting evidence by systematically considering its
6.1.3 Recommendations to use interventions validity, results, and relevance.
only in research Desirable effect: A desirable effect of adherence to a recommendation can include beneficial health
6.1.4 No recommendation outcomes, less burden and savings.
Dose response gradient: The relationship between the quantity of treatment given and its effect on
recommendations outcome.
consequences Effect size (ES): A generic term for the estimate of effect of treatment for a study. Sometimes the term
6.2.1.1 Estimates of the magnitude of the is used to refer to the standardized mean difference.
desirable and undesirable effects To facilitate understanding we suggest interpretation of the effect size offered by Cohen (Cohen J.
6.2.1.2 Best estimates of values and Statistical Power Analysis for the Behavioral Sciences. 2nd ed; 1988). According to this interpretation,
preferences an effect size or standardized mean difference of around:
● 0.2 is considered a small effect
6.3.3 Confidence in values and preferences ● 0.5 is considered a moderate effect
6.3.4 Resource use (cost) ● 0.8 or higher is considered a large effect.
other outcomes Effectiveness: The extent to which an intervention produces a beneficial result under ideal conditions.
6.3.4.2 Perspective Clinical trials that assess effectiveness are sometimes called pragmatic or management trials.
6.3.4.3 Resource implications considered Efficacy: The extent to which an intervention produces a beneficial result under ideal conditions.
6.3.4.4 Confidence in the estimates of Clinical trials that assess efficacy are sometimes called explanatory trials.
resource use (quality of the evidence about Estimate of effect: The observed relationship between an intervention and an outcome expressed as, for
cost) example, a number needed to treat, odds ratio, risk difference, risk ratio, relative risk reduction,
6.3.4.5 Presentation of resource use standardised mean difference, or weighted mean difference.
6.3.4.7 Consideration of resource use in External validity: The extent to which results provide a correct basis for generalisations to other
recommendations circumstances. For instance, a metaanalysis of trials of elderly patients may not be generalizable to
6.4 Presentation of recommendations children. Also calledgeneralizability or applicability.
6.4.1 Wording of recommandations Followup: The observation over a period of time of study/trial participants to measure outcomes under
6.3.2 Symbolic representation investigation.
6.4.3 Providing transparent statements about Hazard ratio (HR): A measure of effect produced by a survival analysis and representing the increased
assumed values and preferences risk with which one group is likely to experience the outcome of interest. For example, if the hazard
6.5 The EvidencetoDecision framework ratio for death for a treatment is 0.5, then we can say that treated patients are likely to die at half the
7. The GRADE approach for diagnostic tests and rate of untreated patients.
7. The GRADE approach for diagnostic tests and rate of untreated patients.
strategies
7.1. Questions about diagnostic tests Intention to treat analysis (ITT): A strategy for analysing data from a randomised controlled trial. All
participants are included in the arm to which they were allocated, whether or not they received (or
completed) the intervention given to that arm. Intentiontotreat analysis prevents bias caused by the loss
7.1.2. Establishing the role of a test of participants, which may disrupt the baseline equivalence established by randomisation and which may
7.2. Gold standard and reference test reflect nonadherence to the protocol. The term is often misused in trial publications when some
participants were excluded.
1. Overview of the GRADE Approach Internal validity: The extent to which the design and conduct of a study are likely to have prevented
1.1 Purpose and advantages of the GRADE bias. Variation in methodological quality can explain variation in the results of studies. More rigorously
approach designed (better quality) trials are more likely to yield results that are closer to the truth.
from strength of recommendations Intervention: The process of intervening on people, groups, entities, or objects in an experimental study.
In controlled trials, the word is sometimes used to describe the regimens in all comparison groups,
including placebo and notreatment arms.
approach
1.4 Modifications to the GRADE approach Mean difference (MD): the ‘difference in means’ is a standard statistic that measures the absolute
2. Framing the health care question difference between the mean value in the two groups in a clinical trial. It estimates the amount by which
2.1 Defining the patient population and intervention the treatment changes the outcome on average. It can be used as a summary statistic in metaanalysis
2.2 Dealing with multiple comparators when outcome measurements in all trials are made on the same scale. Previously referred to as weighted
2.3 Other considerations mean difference (WMD).
2.4 Format of health care questions using the Metaanalysis: The statistical combination of results from two or more separate studies.
GRADE approach
Minimally important difference (MID): The smallest difference in score in the outcome of interest
3. Selecting and rating the importance of outcomes that informed patients or informed proxies perceive as important, either beneficial or harmful, and that
3.1 Steps for considering the relative importance of would lead the patient or clinician to consider a change in the management.
outcomes
3.2 Influence of perspective Number needed to treat (NNT): An estimate of how many people need to receive a treatment before
3.3 Using evidence in rating the importance of one person would experience a beneficial outcome. For example, if you need to give a stroke prevention
outcomes drug to 20 people before one stroke is prevented, then the number needed to treat to benefit for that
3.4 Surrogate (substitute) outcomes stroke prevention drug is 20. It is estimated as the reciprocal of the risk difference.
4. Summarizing the evidence Number needed to harm (NNH): A number needed to treat to benefit associated with a harmful effect.
4.1 Evidence Tables It is an estimate of how many people need to receive a treatment before one more person would
4.2 GRADE Evidence Profile experience a harmful outcome or one fewer person would experience a beneficial outcome.
4.3 Summary of Findings table Observational study: A study in which the investigators do not seek to intervene, and simply observe
5. Quality of evidence the course of events. Changes or differences in one characteristic (e.g. whether or not people received
5.1 Factors determining the quality of evidence the intervention of interest) are studied in relation to changes or differences in other characteristic(s)
5.1.1 Study design (e.g. whether or not they died), without action by the investigator. There is a greater risk of selection
5.2 Factors that can reduce the quality of the bias than in experimental studies.
evidence
Odds ratio (OR): The ratio of the odds of an event in one group to the odds of an event in another group.
5.2.1 Study limitations (Risk of Bias) In studies of treatment effect, the odds in the treatment group are usually divided by the odds in the
5.2.2 Inconsistency of results control group. An odds ratio of one indicates no difference between comparison groups. For undesirable
5.2.2.1 Deciding whether to use estimates outcomes an OR that is less than one indicates that the intervention was effective in reducing the risk of
from a subgroup analysis that outcome. When the risk is small, the value of odds ratio is similar to risk ratio. When the events in
5.2.3 Indirectness of evidence the control group are not frequent, OR and HR can be assumed to be equal to the RR for the application
5.2.4 Imprecision of this criterion.
Optimal information size (OIS): number of patients generated by a conventional sample size
calculation for a single trial.
reviews
5.2.4.3 Rating down two levels for Outcome: A component of a participant's clinical and functional status after an intervention has been
imprecision applied, that is used to assess the effectiveness of an intervention.
5.2.5 Publication bias Point estimate: The results (e.g. mean, weighted mean difference, odds ratio, risk ratio or risk
5.3. Factors that can increase the quality of the difference) obtained in a sample (a study or a metaanalysis) which are used as the best estimate of
evidence what is true for the relevant population from which the sample is taken.
Population: The group of people being studied, usually by taking samples from that population.
Populations may be defined by any characteristics e.g. geography, age group, certain diseases.
5.4 Overall quality of evidence Precision: A measure of the likelihood of random errors in the results of a study, metaanalysis or
6. Going from evidence to recommendations measurement. The less random error the greater the precision. Confidence intervals around the estimate
6.1 Recommendations and their strength of effect from each study are one way of expressing precision, with a narrower confidence interval
6.1.1 Strong recommendation meaning more precision.
6.1.2 Weak recommendation Quality of evidence: The extent to which one can be confident that an estimate of effect is correct.
Randomised controlled trial (RCT): An experimental study in which two or more interventions are
only in research
compared by being randomly allocated to participants. In most trials one intervention is assigned to each
6.1.4 No recommendation individual but sometimes assignment is to defined groups of individuals (for example, in a household) or
6.2 Factors determining direction and strength of interventions are assigned within individuals (for example, in different orders or to different parts of the
recommendations body).
consequences Relative risk (RR): Synonym of risk ratio. The ratio of risks in two groups. In intervention studies, it is
6.2.1.1 Estimates of the magnitude of the the ratio of the risk in the intervention group to the risk in the control group. A risk ratio of one indicates
desirable and undesirable effects no difference between comparison groups. For undesirable outcomes, a risk ratio that is less than one
indicates that the intervention was effective in reducing the risk of that outcome.
preferences Relative risk reduction (RRR): The proportional reduction in risk in one treatment group compared to
6.3.2 Confidence in best estimates of another. It is one minus the risk ratio. If the risk ratio is 0.25, then the relative risk reduction is 1
magnitude of effects (quality of evidence) 0.25=0.75, or 75%.
6.3.3 Confidence in values and preferences Review Manager (RevMan): Software used for preparing and maintaining Cochrane systematic
6.3.4 Resource use (cost) reviews. RevMan allows you to write ad manage systematic review protocols, as well as complete
6.3.4.1 Differences between costs and reviews, including text, tables, and study data. It can perform metaanalysis of the data entered, and
other outcomes present the results graphically.
6.3.4.2 Perspective
Risk: The proportion of participants experiencing the event of interest. Thus, if out of 100 participants
6.3.4.3 Resource implications considered the event (e.g. a stroke) is observed in 32, the risk is 0.32. The control group risk is the risk amongst the
6.3.4.4 Confidence in the estimates of control group. The risk may sometimes be referred to as the event rate.
cost) Standardised mean difference (SMD): The difference between two estimated means divided by an
6.3.4.5 Presentation of resource use estimate of the standard deviation. It is used to combine results from studies using different ways of
6.3.4.6 Economic model measuring the same continuous variable, e.g. pain. By expressing the effects as a standardised value, the
6.3.4.7 Consideration of resource use in results can be combined since they have no units. Standardised mean differences are sometimes referred
to as a d index.
recommendations
6.4 Presentation of recommendations Statistically significant: A result that is unlikely to have happened by chance. The usual threshold for
6.4.1 Wording of recommandations this judgement is that the results, or more extreme results, would occur by chance with a probability of
6.3.2 Symbolic representation less than 0.05 if the null hypothesis was true. Statistical tests produce a pvalue used to assess this.
6.4.3 Providing transparent statements about Strength of a recommendation: The degree of confidence that the desirable effects of adherence to a
assumed values and preferences recommendation outweigh the undesirable effects.
7. The GRADE approach for diagnostic tests and Surrogate outcome: Outcome measure that is not of direct practical importance but is believed to
reflect an outcome that is important; for example, blood pressure is not directly important to patients but
strategies
it is often used as an outcome in clinical trials because it is a risk factor for stroke and heart attacks.
7.1. Questions about diagnostic tests Surrogate outcomes are often physiological or biochemical markers that can be relatively quickly and
7.1.1. Establishing the purpose of a test easily measured, and that are taken as being predictive of important clinical outcomes. They are often
easily measured, and that are taken as being predictive of important clinical outcomes. They are often
7.1.2. Establishing the role of a test used when observation of clinical outcomes requires long followup. Also called: intermediary outcomes
7.1.3. Clear clinical questions or surrogate endpoints.
Systematic review: A review of a clearly formulated question that uses systematic and explicit methods
to identify, select, and critically appraise relevant research, and to collect and analyse data from the
1. Overview of the GRADE Approach studies that are included in the review. Statistical methods (metaanalysis) may or may not be used to
1.1 Purpose and advantages of the GRADE analyse and summarise the results of the included studies.
approach
1.2 Separation of confidence in effect estimates Undesirable effect: An undesirable effect of adherence to a recommendation can include harms, more
from strength of recommendations burden, and costs.
approach
2.1 Defining the patient population and intervention 10. Articles about GRADE
The following is a collection of published documents about the GRADE approach.
GRADE approach
3. Selecting and rating the importance of outcomes Introductory series published in the BMJ (2008)
outcomes 1. GRADE: an emerging consensus | LINK | PDF | PubMed
3.2 Influence of perspective 2. What is “quality of evidence” and why is it important to clinicians? | LINK | PDF | PubMed
outcomes 3. Going from evidence to recommendations | LINK | PDF | PubMed
3.4 Surrogate (substitute) outcomes 4. Grading quality of evidence and strength of recommendations for diagnostic tests and
4. Summarizing the evidence strategies | LINK | PDF | PubMed
4.1 Evidence Tables
4.2 GRADE Evidence Profile 5. Incorporating considerations of resources use into grading recommendations | LINK | PDF | PubMed
6. Use of GRADE grid to reach decisions when consensus is elusive | LINK | PDF | PubMed
5.1.1 Study design Series of articles with examples from the field of allergy published in Allergy (2010)
5.2 Factors that can reduce the quality of the 1. Overview of the GRADE approach and grading quality of evidence about interventions | LINK | PDF |
evidence PubMed
5.2.1 Study limitations (Risk of Bias) 2. GRADE approach to grading quality of evidence about diagnostic tests and strategies | LINK | PDF |
5.2.2 Inconsistency of results PubMed
from a subgroup analysis 3. GRADE approach to developing recommendations | LINK | PDF | PubMed
5.2.4 Imprecision Series of detailed articles for authors of guidelines and systematic reviews published in JCE
5.2.4.1 Imprecision in guidelines (20112014)
5.2.4.2 Imprecision in in systematic 1. Introduction: GRADE evidence profiles and summary of findings tables | LINK | PDF | PubMed
reviews
5.2.4.3 Rating down two levels for 2. Framing the question and deciding on important outcomes | LINK | PDF | PubMed
imprecision 3. Rating the quality of evidence | LINK | PDF | PubMed
5.3. Factors that can increase the quality of the 4. Rating the quality of evidence: study limitations (risk of bias) | LINK | PDF | PubMed
evidence
5. Rating the quality of evidence: publication bias | LINK | PDF | PubMed
5.3.2. Doseresponse gradient 6. Rating the quality of evidence: imprecision | LINK | PDF | PubMed
5.4 Overall quality of evidence 7. Rating the quality of evidence: inconsistency | LINK | PDF | PubMed
6. Going from evidence to recommendations 8. Rating the quality of evidence: indirectness | LINK | PDF | PubMed
6.1.1 Strong recommendation 9. Rating up the quality of evidence | LINK | PDF | PubMed
10. Considering resource use and rating the quality of economic evidence | LINK | PDF | PubMed
only in research 11. Making an overall rating of confidence in effect estimates for a single outcome and for all outcomes |
6.1.4 No recommendation LINK | PDF | PubMed
12. Preparing Summary of Findings tables for binary outcomes | LINK | PDF | PubMed
recommendations
6.2.1 Balance of desirable and undesirable 13. Preparing Summary of Findings tables for continuous outcomes | LINK | PDF | PubMed
consequences
6.2.1.1 Estimates of the magnitude of the 14. Going from evidence to recommendations: the significance and presentation of recommendations |
desirable and undesirable effects LINK | PDF | PubMed
6.2.1.2 Best estimates of values and 15. Going from evidence to recommendations: determinants of a recommendation’s direction and
preferences strength | LINK | PDF | PubMed
magnitude of effects (quality of evidence) 16.
6.3.3 Confidence in values and preferences 17.
6.3.4.1 Differences between costs and 18.
other outcomes
19.
6.3.4.2 Perspective
6.3.4.3 Resource implications considered 20.
Reproducibility of the GRADE approach (2013)
resource use (quality of the evidence about The GRADE approach is reproducible in assessing the quality of evidence of quantitative evidence
cost) syntheses | PDF | PubMed
recommendations
11. Additional resources
6.3.2 Symbolic representation Resources for authors of systematic reviews
6.4.3 Providing transparent statements about The Cochrane Handbook
6.5 The EvidencetoDecision framework The Cochrane Handbook includes two principle chapters which provide information on how to create
7. The GRADE approach for diagnostic tests and Summary of Findings tables using the information from Cochrane systematic reviews and GRADEing
strategies the evidence.
7.1. Questions about diagnostic tests Part 2 Chapter 11: Presenting results and ‘Summary of findings’ tables
Part 2 Chapter 12: Interpreting results and drawing conclusions
7.1.3. Clear clinical questions General evidencebased medicine resources
7.2. Gold standard and reference test The Cochrane Library
The Cochrane Library contains highquality, independent evidence to inform healthcare decisionmaking.
The Cochrane Library contains highquality, independent evidence to inform healthcare decisionmaking.
1. Overview of the GRADE Approach It includes reliable evidence from Cochrane and other systematic reviews, clinical trials, and more.
1.1 Purpose and advantages of the GRADE Cochrane reviews bring you the combined results of the world’s best medical research studies, and are
approach recognised as the gold standard in evidencebased health care.
from strength of recommendations The Cochrane Handbook
1.3 Special challenges in applying the the GRADE The Cochrane Handbook for Systematic Reviews of Interventions (the Handbook) provides guidance to
approach authors for the preparation of Cochrane Intervention reviews (including Cochrane Overviews of
1.4 Modifications to the GRADE approach reviews). The Handbook is updated regularly to reflect advances in systematic review methodology and
2. Framing the health care question in response to feedback from users.
2.1 Defining the patient population and intervention Users' Guides to the Medical Literature
2.3 Other considerations A complete set of Users' Guides to find, evaluate and use medical literature which were originally
2.4 Format of health care questions using the published as a series in the Journal of the American Medical Association (JAMA).
GRADE approach Users' Guides to the Medical Literature: A Manual for EvidenceBased Clinical Practice
3. Selecting and rating the importance of outcomes (Interactive) presents the sophisticated concepts of evidencebased medicine (EBM) in unique ways that
3.1 Steps for considering the relative importance of can be used to determine diagnosis, decide optimal therapy, and predict prognosis. It also offers indepth
outcomes expansion of methodology, statistics, and cost issues that emerge in medical research.
3.2 Influence of perspective Guideline specific resources
outcomes Improving the use of research evidence in guideline development (SERIES)
3.4 Surrogate (substitute) outcomes A series of 16 papers published in Health Research Policy and Systems in 2006, Volume 4, Issues 12 to
4. Summarizing the evidence 28 about guideline development. Topics are Guidelines for guidelines, Priority setting, Group
4.1 Evidence Tables composition and consultation process, Managing conflicts of interest, Group processes, Determining
4.2 GRADE Evidence Profile which outcomes are important, Deciding what evidence to include, Synthesis and presentation of
4.3 Summary of Findings table evidence, Grading evidence and recommendations, Integrating values and consumer involvement,
5. Quality of evidence Incorporating considerations of costeffectiveness, affordability and resource implications, Incorporating
5.1 Factors determining the quality of evidence considerations of equity, Adaptation, applicability and transferability, Reporting guidelines,
5.1.1 Study design Disseminating and implementing guidelines, and Evaluation.
5.2 Factors that can reduce the quality of the The AGREE instrument
evidence
5.2.1 Study limitations (Risk of Bias) The purpose of the Appraisal of Guidelines Research & Evaluation (AGREE) Instrument is to provide a
framework for assessing the quality of clinical practice guidelines.
5.2.2.1 Deciding whether to use estimates GRADE Working Group
from a subgroup analysis The Grading of Recommendations Assessment, Development and Evaluation (short GRADE) Working
5.2.3 Indirectness of evidence Group began in the year 2000 as an informal collaboration of people with an interest in addressing the
5.2.4 Imprecision shortcomings of present grading systems in health care. Our aim is to develop a common, sensible
5.2.4.1 Imprecision in guidelines approach to grading quality of evidence and strength of recommendation.
reviews Guidelines Advisory Committee
5.2.4.3 Rating down two levels for The Guidelines Advisory Committee (GAC) is an independent partnership of the Ontario Medical
imprecision Association and the Ontario Ministry of Health and Long Term Care (MOHLTC). The GACs mission is
5.2.5 Publication bias to promote better health for the people of Ontario by encouraging physicians and other practitioners to
5.3. Factors that can increase the quality of the use evidencebased clinical practice guidelines and clinical practices based on best available evidence.
evidence We identify, evaluate, endorse and summarize guidelines for use in Ontario.
5.3.1 Large magnitude of an effect National Guideline Clearing House
5.3.3. Effect of plausible residual confounding The National Guideline Clearinghouse (NGC) is a comprehensive database of evidencebased clinical
practice guidelines and related documents. NGC is an initiative of the Agency for Healthcare Research
and Quality (AHRQ), U.S. Department of Health and Human Services.
6.1 Recommendations and their strength National Library of Guidelines
6.1.1 Strong recommendation The National Library of Guidelines is a collection of guidelines for the NHS. It is based on the
6.1.2 Weak recommendation guidelines produced by NICE and other national agencies. The main focus of the Library is on guidelines
6.1.3 Recommendations to use interventions produced in the UK, but where no UK guideline is available, guidelines from other countries are
only in research included in the collection.
recommendations
consequences
12. The GRADE Working Group
6.2.1.2 Best estimates of values and The Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group
preferences began in the year 2000 as an informal collaboration of more than 60 methodologists, clinicians,
6.3.2 Confidence in best estimates of systematic reviewers, and guideline developers representing various organizations with the goal to
magnitude of effects (quality of evidence) address shortcomings of present grading systems in health care. The aim was to develop a common,
6.3.3 Confidence in values and preferences sensible approach to grading quality of evidence and strength of recommendations. Based on shared
6.3.4 Resource use (cost) experience, a critical review of other systems, and working through examples and applying the system in
6.3.4.1 Differences between costs and guidelines, the Working Group has developed the GRADE approach as a common, transparent and
other outcomes sensible method to grading quality of evidence and strength of recommendations.
6.3.4.2 Perspective Several organizations that are now using or endorsing the GRADE approach in its original format or
6.3.4.3 Resource implications considered with minor modifications:
6.3.4.4 Confidence in the estimates of [INSERT LIST OF ORGANIZATIONS]
cost)
6.3.4.7 Consideration of resource use in [1]
recommendations
strategies

GRADE Handbook

Uploaded by

Copyright:

Available Formats

GRADE Handbook

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GRADE Handbook

Uploaded by

Copyright:

Available Formats

1.

Overview of the GRADE Approach

1. Overview of the GRADE Approach 3.2 Influence of perspective

1. Overview of the GRADE Approach

5.1 Factors determining the quality of evidence ○ No included

5.2.4.2 Imprecision in in systematic

1. Overview of the GRADE Approach

1. Overview of the GRADE Approach

You might also like