ScribeLab Writer
Get a Quote

How to Critically Appraise Studies in a Systematic Review: RoB 2, ROBINS-I V2, GRADE, CASP and More

Written by Dr. Alina Grace

Published June 16, 2026 · 34 min read

How to Critically Appraise Studies in a Systematic Review: RoB 2, ROBINS-I V2, GRADE, CASP and More

Every systematic review that goes through peer review faces the same scrutiny at the same point: the risk-of-bias section. Editors and reviewers do not just check whether you appraised your studies. They check whether you used the right tool for the right study design, applied it at the right level, and connected the results to your certainty-of-evidence judgments. Pick the wrong tool, and even a technically sound review can be returned with comments that require months of additional work.

This happens more often than most researchers expect. The Newcastle-Ottawa Scale (NOS) is applied to randomized controlled trials. CASP is used as the primary appraisal framework for a meta-analysis where no formal domain-level judgment was produced. RoB 2 applied once per trial instead of once per reported outcome. Risk-of-bias results are never connected to the GRADE Summary of Findings table. Each of these errors signals to a peer reviewer that the appraisal was conducted without a clear understanding of each tool's intended purpose. This guide covers every major critical appraisal tool used in systematic reviews, the evidence supporting each, a decision tree for choosing among them, and the November 2025 update to ROBINS-I, which changes the standard for non-randomized studies. For research teams who need a second independent reviewer or support completing GRADE certainty ratings and Summary of Findings tables, ScribeLab Writer's systematic review service covers the full appraisal stage from tool selection through to robvis visualization and manuscript submission.

Quick Answer:

The tool you use depends on the study designs in your included evidence. RCTs use RoB 2 (Sterne et al., 2019). Non-randomized studies of interventions use ROBINS-I V2 (November 2025 update). Diagnostic accuracy studies use QUADAS-2 (Whiting et al., 2011). Qualitative studies use the CASP Qualitative checklist. Cohort and case-control studies that do not assess intervention effects use the Newcastle-Ottawa Scale, with the caveat that the NOS has not been formally validated. Umbrella reviews appraising included systematic reviews used AMSTAR-2 (Shea et al., 2017) or ROBIS. Risk-of-bias results then feed into GRADE domain-by-domain certainty ratings at the outcome level, not the study level. The tool must be pre-specified in your protocol before you begin.

What Is Critical Appraisal and Why Does It Shape Your Entire Review?

Critical appraisal in a systematic review is the structured assessment of each included study for internal validity (the risk that the study's methods introduced bias into its results) and applicability (whether the study population, setting, and outcomes are relevant to the clinical question being reviewed). It is not a quality score, it is not a pass/fail filter, and it is not an optional extra that can be added after the review is otherwise complete.

The reason critical appraisal matters at every level of the review is the concept of downstream consequence. A study rated as high risk of bias does not automatically exclude it from a systematic review. It stays in. But its risk-of-bias rating directly affects the GRADE certainty of evidence for each outcome it contributes to. A body of evidence in which every included RCT has concerns in the measurement-of-outcome domain will be downgraded in that domain, lowering the certainty from high to moderate or from moderate to low. That downgrading then appears in the Summary of Findings table and determines how strongly the review can recommend a clinical action.

What this means for you as a reviewer is that appraisal is not something you do once in the middle of the review and then move on from. It runs through the methods, results, GRADE table, and discussion. It is also one of the areas most frequently criticized in peer review, specifically because reviewers know that a weak or misapplied appraisal can make a biased body of evidence appear more trustworthy than it is. The most common reasons systematic reviews get rejected by journals consistently include inadequate or misapplied critical appraisal.

The Master Decision Tree: Matching Study Design to Appraisal Tool

The single most important question before you begin an appraisal is: what study designs are included in your evidence? The answer determines which tool you use. No single tool covers all designs.

Table 1: Master Decision Tree for Critical Appraisal Tool Selection

Study Design

Correct Tool and Citation

Overall Judgment Levels

Access

Randomized controlled trial (RCT)

RoB 2 — Sterne et al., BMJ 2019;366:l4898

Low / Some concerns / High

riskofbias.info

Non-randomized study of intervention (NRSI)

ROBINS-I V2 (updated November 2025) — V1: Sterne et al., BMJ 2016;355:i4919

Low / Moderate / Serious / Critical / No information

riskofbias.info/welcome/robins-i-v2

Observational cohort or case-control (not assessing intervention effects)

Newcastle-Ottawa Scale (NOS) — Wells et al., Ottawa Hospital Research Institute. Note: not formally validated. For intervention effects, use ROBINS-I V2.

0–9 stars (cohort: Selection + Comparability + Outcome; case-control: Selection + Comparability + Exposure)

ohri.ca

Diagnostic accuracy study

QUADAS-2 — Whiting et al., Ann Intern Med 2011;155(8):529–536

Per-domain: Low / High / Unclear risk of bias, plus Applicability concerns on 3 of 4 domains

quadas.net

Comparative diagnostic accuracy

QUADAS-C — Yang et al., Ann Intern Med 2021;174(11):1592–1599. Used alongside QUADAS-2, not instead of it.

Same 4-domain structure as QUADAS-2

via quadas.net

Qualitative study

CASP Qualitative Checklist — Oxford Centre for Triple Value Healthcare

Yes / No / Can't tell per item — generates a profile, not a domain-level risk-of-bias judgment

casp-uk.net

Systematic review in umbrella review (methodological quality)

AMSTAR 2 — Shea et al., BMJ 2017;358:j4008

High / Moderate / Low / Critically low — no numeric score by design; 7 critical domains

amstar.ca

Systematic review in umbrella review (risk of bias in conclusions)

ROBIS — Whiting et al., J Clin Epidemiol 2016;69:225–234

Low / High / Unclear — 3 phases, 4 Phase-2 domains

bristol.ac.uk/robis

Animal study (in vivo)

SYRCLE — Hooijmans et al., Lab Anim 2014;48(3):207–215

Per-domain (adapted from original Cochrane risk-of-bias structure)

syrcle.ru.nl

Cross-sectional or survey study

AXIS Tool — Downes et al., BMJ Open 2016;6:e011458. Alternative: JBI critical appraisal checklist.

Per-item (20 questions) — Yes / No / Unsure

axis.scienz.tech

This table is the framework for your methods section. State which tool you used for each design category, cite the tool's primary publication, and note that the tool was pre-specified in the protocol.

RoB 2: The Standard Tool for Randomized Controlled Trials

RoB 2 is the current Cochrane and international standard for assessing risk of bias in randomized trials. It was published by Sterne and colleagues in 2019 after more than a decade of development within the Cochrane Bias Methods Group and replaced the original Cochrane Risk of Bias tool (RoB 1, 2008).

Citation: Sterne JAC, Savović J, Page MJ, Elbers RG, Blencowe NS, Boutron I, Cuthbert C, Kirkham JJ, Marguerat P, McGuinness LA, Stewart LA, Sutton A, Higgins JPT. RoB 2: a revised tool for assessing risk of bias in randomized trials. BMJ. 2019 Aug 28;366:l4898. DOI: 10.1136/bmj.l4898. Open access.

The Five Domains of RoB 2

Each domain contains a set of signaling questions answered as Yes, Probably Yes, Probably No, No, or No Information. A published algorithm maps those answers to a proposed domain-level judgment of Low, Some concerns, or High. The algorithm is available in the guidance document and in the Excel templates at riskofbias.info.

Domain 1: Bias arising from the randomization process. This covers whether the allocation sequence was truly random, whether it was adequately concealed from those responsible for enrolling and assigning participants to treatment groups, and whether baseline imbalances suggest the randomization failed. A study with unconcealed allocation that shows major baseline imbalances should receive a High rating on this domain.

Domain 2: Bias due to deviations from intended interventions. This domain asks whether participants, care providers, or outcome assessors knew which intervention they received, and whether that knowledge caused what was actually given or done to differ from what the trial protocol intended. For example, that knowledge can prompt care providers to add co-interventions for control-group patients, or lead participants to seek additional treatments outside the trial. RoB 2 requires you to specify which target effect (called the estimand) you are measuring before completing this domain, because the signaling questions differ depending on your answer. The estimand is simply the precise quantity the analysis is designed to estimate: "what exact effect am I trying to measure?" RoB 2 recognizes two options. The first is the effect of being assigned to the intervention regardless of whether participants actually took it (the intention-to-treat estimand, relevant for pragmatic trials measuring real-world effectiveness). The second is the effect of actually receiving and adhering to the intervention as prescribed (the per-protocol estimand, relevant when compliance matters and efficacy under ideal conditions is the question). For the intention-to-treat estimand, the risk-of-bias concern is whether unblinding caused additional co-interventions that distorted the comparison between arms. For the per-protocol estimand, the concern is whether the analysis handled non-adherence correctly without introducing selection bias.

Domain 3: Bias due to missing outcome data. Missing data becomes a risk-of-bias concern when data are missing for reasons related to the true outcome (informative missoring). The signaling questions ask whether outcome data were available for all randomized participants, and if not, whether the proportions and reasons differed between arms.

Domain 4: Bias in measurement of the outcome. This domain asks whether the method of outcome measurement was appropriate, whether outcome assessors were blinded, and whether outcomes were measured in a way that could have been influenced by knowledge of the intervention received. Patient-reported outcomes in unblinded trials are frequently rated High or Some concerns here unless an objective measure was prespecified.

Domain 5: Bias in the selection of the reported result. This domain captures selective outcome reporting, meaning the difference between what was pre-specified and what was actually reported. This requires comparing the published results against a pre-registered protocol or a trial registration record.

How RoB 2 Is Applied: Per Outcome, Not Per Trial

This is the structural feature of RoB 2 that most researchers misunderstand, and it is one of the things that competitors' articles consistently fail to explain. RoB 2 is applied per result, meaning per outcome per analysis, not once for the entire trial. A trial may receive a low risk of bias for its primary outcome and some concerns for a secondary outcome if the secondary outcome was not pre-specified. This creates the requirement for an appraisal table that has rows for each trial-outcome combination, not just one row per trial.

The Four RoB 2 Variants

The base tool covers individually randomized parallel-group trials. Separate extended templates are available for cluster-randomized trials (which add a domain for bias arising from the timing of identification and recruitment of participants) and crossover trials (which add a domain for bias arising from period and carryover effects). These variants are hosted at riskofbias.info along with their guidance documents and Excel templates.

The Overall Risk-of-Bias Judgment

The worst-performing domain determines the overall judgment (Low, Some concerns, or High). If all domains are rated Low, the overall judgment is Low. If any domain is rated Some concerns and none are rated High, the overall judgment is Some concerns. If any domain is rated High, or if combinations of Some concerns judgments are considered collectively sufficient to indicate a High risk of bias, the overall judgment is High.

How Long Does RoB 2 Take?

Minozzi and colleagues (2020, Journal of Clinical Epidemiology, 126:37-44) measured a mean application time of 28 minutes per study outcome (70 outcomes from 70 RCTs) before implementation of a structured guidance document, with slight interrater reliability for the overall judgment (Fleiss' kappa 0.16; 95% CI 0.08 to 0.24). Only the randomization-process domain reached moderate agreement (kappa 0.45; 95% CI 0.37 to 0.53). A follow-up study (Minozzi et al., 2022) reported a mean of 168.5 minutes per study when all outcome-level assessments were conducted before a structured guidance document was introduced. These data illustrate why pre-specifying your tool and conducting a pilot appraisal exercise before the full review begins is not a bureaucratic formality but a genuine time-management necessity.

ROBINS-I V2: The 2025 Update for Non-Randomized Studies

Non-randomized studies of interventions (NRSI) include cohort studies, before-and-after studies, interrupted time series, and other designs that assess the effect of an intervention without random allocation. These studies are common in systematic reviews of public health interventions, complex health system interventions, and educational interventions where randomization is not feasible. Appraising them requires a tool that accounts for confounding, selection bias, and information bias in ways that RoB 2 does not.

ROBINS-I V1 (2016): Sterne JAC, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, Henry D, Altman DG, Ansari MT, Boutron I, Carpenter JR, Chan AW, Churchill R, Deeks JJ, Hróbjartsson A, Kirkham J, Jüni P, Loke YK, Pigott TD, Ramsay CR, Regidor D, Rothstein HR, Sandhu L, Santaguida PL, Schünemann HJ, Shea B, Shrier I, Tugwell P, Turner L, Valentine JC, Waddington H, Waters E, Wells GA, Whiting PF, Higgins JPT. ROBINS-I: a tool for assessing risk of bias in non-randomized studies of interventions. BMJ. 2016;355:i4919. DOI: 10.1136/bmj.i4919.

ROBINS-I V2: What Changed in November 2025

ROBINS-I V2 was developed by the team led by Jonathan Sterne and Julian Higgins, funded in part by Medical Research Council grant MR/M025209/1, and involving the Cochrane Bias Methods Group and the Cochrane Non-Randomized Studies Methods Group. The revised draft was posted on 30 November 2025 at riskofbias.info, following its introduction in a Cochrane webinar delivered in October 2025. As of this writing, V2 is scoped to follow-up (cohort) study designs, with extensions for other non-randomized designs under development.

Three key structural changes in V2:

First, V2 introduces published algorithms that map the signaling-question answers onto the proposed domain judgments, removing the interpretive ambiguity that characterized V1. In V1, reviewers were required to make a judgment call that was not always consistently derivable from the signaling questions alone. V2 makes the mapping explicit.

Second, V2 introduces "strong" and "weak" yes/no response levels, allowing reviewers to distinguish between cases where evidence clearly supports a judgment and cases where the judgment is plausible but uncertain. This refinement enables more nuanced domain ratings.

Third, V2 introduces explicit guidance for bias due to immortal time, a specific type of time-related confounding that was not addressed in ROBINS-I V1 and that is a particularly common source of bias in pharmacoepidemiology and healthcare database studies.

The Domains of ROBINS-I

ROBINS-I V1 assessed seven domains: (1) bias due to confounding; (2) bias in selection of participants into the study; (3) bias in classification of interventions; (4) bias due to deviations from intended interventions; (5) bias due to missing data; (6) bias in measurement of outcomes; (7) bias in selection of the reported result. ROBINS-I V2 reduces and restructures this into six domains, with revised domain scope and new signaling questions aligned to the algorithms.

Overall Judgments in ROBINS-I

The V1 overall judgment options are: Low (comparable in bias to a well-conducted RCT), Moderate (some risk, but unlikely to seriously alter the estimated effect), Serious (risk that may substantially alter the estimated effect), Critical (risk sufficient to render the study meaningless for the review question), and No information. These translate directly into GRADE: a study rated Moderate or Serious on ROBINS-I should trigger downgrading on the "risk of bias" GRADE domain for the outcomes it contributes. V2 carries forward the same judgment structure.

The Excel template and full guidance documents for both V1 and V2 are available at riskofbias.info/welcome/robins-i-v2.

Newcastle-Ottawa Scale: What It Covers and Its Critical Limitations

The Newcastle-Ottawa Scale (NOS) was developed by Wells and colleagues at the Ottawa Hospital Research Institute for use in cohort studies and case-control studies. It is one of the most widely used observational-study appraisal tools in published systematic reviews, and it is also one of the most frequently misapplied.

NOS for cohort studies has three domains: Selection (up to 4 stars), Comparability (up to 2 stars), and Outcome (up to 3 stars), for a maximum of 9 stars. NOS for case-control studies has three domains: Selection (up to 4 stars), Comparability (up to 2 stars), and Exposure (up to 3 stars).

The critical limitation that competitors' guides rarely state plainly: NOS was never formally validated. Stang (2010) in the European Journal of Epidemiology (DOI: 10.1007/s10654-010-9491-z) published the definitive critique, identifying that NOS items lack clear definitions, cutoffs for converting the star ratings to a dichotomous quality judgment are arbitrary, and the scale has not been tested for construct validity against actual study quality. A follow-up by Stang, Jonas, and Poole (2018, DOI: 10.1007/s10654-018-0443-3) documented widespread quotation errors in the 1,250 reviews that cited the 2010 critique by December 2016.

The practical consequence: When your review includes NOS, you must acknowledge this limitation explicitly. You cannot present a NOS star rating as equivalent to the domain-level certainty of a RoB 2 assessment. For non-randomized intervention studies specifically, ROBINS-I V2 is the methodologically appropriate replacement. NOS should be reserved for observational studies assessing exposure-outcome relationships rather than intervention effects.

QUADAS-2: Critical Appraisal for Diagnostic Accuracy Studies

Systematic reviews of diagnostic tests assess how well an index test (the test being evaluated) agrees with a reference standard (the gold-standard diagnosis). Appraising the studies in these reviews requires a tool designed specifically for this question, because the sources of bias in diagnostic accuracy studies are fundamentally different from those in intervention trials.

Citation: Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang MM, Sterne JA, Bossuyt PM; QUADAS-2 Group. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011 Oct 18;155(8):529-536. DOI: 10.7326/0003-4819-155-8-201110180-00009.

The Four Domains of QUADAS-2

Domain 1: Patient selection. Asks whether patients were enrolled consecutively or randomly, whether any inappropriate exclusions were made, and whether a case-control design was avoided. A case-control design (recruiting known-positive and known-negative patients separately) inflates sensitivity and specificity estimates and represents a high risk of bias.

Domain 2: Index test. Asks whether the index test results were interpreted without knowledge of the results of the reference standard, and whether a threshold was pre-specified. If the index test assessor knew the reference standard result, the accuracy estimate would be inflated.

Domain 3: Reference standard. Asks whether the reference standard is likely to classify the target condition correctly and whether results were interpreted without knowledge of the index test result. A weak reference standard produces estimates of accuracy that reflect the comparison between the index test and the reference standard, not the true underlying condition.

Domain 4: Flow and timing. Asks whether there was an appropriate interval between the index test and reference standard, whether all patients received the same reference standard, and whether all patients were included in the analysis. Partial verification (only some patients receive the reference standard) and differential verification (different reference standards for different patient groups) both introduce bias.

All four domains are rated for risk of bias (low, high, or unclear). The first three domains are additionally rated for applicability concerns (concerns about whether the study matches the review question). Applicability concerns are not risk-of-bias ratings but affect how the evidence can be used clinically.

QUADAS-C (Yang et al., 2021, Annals of Internal Medicine, DOI: 10.7326/M21-2234) is the extension for comparative diagnostic accuracy studies (comparing two tests head-to-head). It uses the same four-domain structure as QUADAS-2 and is applied in addition to, not instead of, QUADAS-2.

CASP Checklists: When to Use Them and When Not To

The Critical Appraisal Skills Program (CASP), developed by the Oxford Centre for Triple Value Healthcare, provides free downloadable checklists for nine study designs, including systematic reviews, RCTs, cohort studies, case-control studies, qualitative studies, diagnostic test studies, economic evaluations, clinical prediction rules, and cross-sectional studies. All checklists are freely available at casp-uk.net/casp-tools-checklists/.

What CASP does: CASP generates a yes/no/can't-tell profile for each study, which shows which methodological criteria the study does and does not meet. This profile is descriptive, not a domain-level risk-of-bias judgment in the sense that RoB 2 produces.

When CASP is appropriate: qualitative systematic reviews, mixed-methods reviews, and scoping reviews where formal risk-of-bias assessment using RoB 2 or ROBINS-I is optional or excluded. For qualitative studies, the CASP Qualitative checklist is the most widely used appraisal tool in nursing and health services research. For mixed-methods reviews, it provides a consistent framework across diverse study designs.

When CASP is not appropriate: as the primary appraisal tool in a quantitative systematic review where meta-analysis is planned and where GRADE certainty ratings are required. CASP does not generate the domain-level judgments that map onto GRADE domains, and using it instead of RoB 2 or ROBINS-I in a meta-analysis will be flagged by peer reviewers.

AMSTAR 2: Appraising Included Systematic Reviews in Umbrella Reviews

When your review includes other systematic reviews as the unit of analysis (an umbrella review or overview of reviews), you need a tool to assess the methodological quality of each included review. AMSTAR 2 (A Measurement Tool to Assess Systematic Reviews, version 2) is the current standard for this purpose.

Citation: Shea BJ, Reeves BC, Wells G, Thuku M, Hamel C, Moran J, Moher D, Tugwell P, Welch V, Kristjansson E, Henry DA. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomized or non-randomized studies of healthcare interventions, or both. BMJ. 2017 Sep 21;358:j4008. DOI: 10.1136/bmj.j4008.

AMSTAR 2 has 16 items, of which 7 are considered critical domains: whether a PICO framework was used; whether the protocol was pre-registered; whether the search strategy was comprehensive; whether data extraction was performed in duplicate; whether risk of bias was assessed at the study level; whether meta-analytic pooling used appropriate methods; and whether meta-analysis accounted for risk of bias in individual studies.

Overall confidence ratings: High, Moderate, Low, or Critically low. A single critical flaw drops confidence to Low or Critically low regardless of how well other items were handled. AMSTAR 2 deliberately produces no numeric score because the 16 items do not carry equal weight, and summing them would imply equivalence that does not exist. The checklist and guidance are available at amstar.ca.

AMSTAR 2 versus ROBIS: ROBIS (Whiting P et al., J Clin Epidemiol. 2016;69:225-234, DOI: 10.1016/j.jclinepi.2015.06.005) assesses risk of bias specifically in the conclusions of a systematic review, while AMSTAR 2 assesses methodological quality and reporting conduct confidence. ROBIS is more appropriate when the specific concern is whether the review's conclusions may be misleading due to a biased process; AMSTAR 2 is more appropriate for assessing the overall trustworthiness of a review for inclusion in an umbrella analysis.

GRADE: How Your Appraisal Results Feed Certainty of Evidence

Critical appraisal generates domain-level risk-of-bias ratings for each study and each outcome. GRADE (Grading of Recommendations, Assessment, Development and Evaluations) is the framework for translating those ratings into certainty-of-evidence judgments for each body of evidence, organized by outcome.

The key structural point: GRADE operates at the outcome level, not the study level. You do not produce one GRADE certainty rating for each included trial. You produce one GRADE certainty rating for each outcome of interest, drawing on all the studies that contribute evidence to that outcome. This means your GRADE table has rows for outcomes, not for studies.

The five GRADE domains that can lower certainty (downgrading):

Risk of bias. Your RoB 2 or ROBINS-I ratings feed directly here. A body of RCTs where most are rated Some concerns, or High gets downgraded. For ROBINS-I, Moderate or Serious ratings suggest downgrading; Critical suggests very serious downgrading. Cochrane Handbook Chapter 14, updated August 2023 in version 6.5, is the governing reference.

Inconsistency. High unexplained heterogeneity (discussed in detail in the companion article on meta-analysis heterogeneity) is addressed here. If the effect estimates vary widely across studies and subgroup differences cannot explain the variation, the certainty is downgraded.

Indirectness. The question is whether the population, intervention, comparator, or outcome in the included studies is sufficiently similar to the clinical question the review is answering. If your review question concerns elderly patients but all included trials were conducted in working-age adults, the evidence is indirect.

Imprecision. Wide confidence intervals that cross the line of no effect, or a pooled estimate based on very few events or participants, signal imprecision. The threshold for downgrading depends on whether the confidence interval crosses a clinical decision threshold.

Publication bias. If evidence suggests that unpublished studies with negative results exist (assessed through funnel plot asymmetry, Egger's test, or evidence of selective outcome reporting), certainty is downgraded.

The three GRADE domains that can raise certainty (upgrading, for non-randomized evidence):

A large magnitude of effect (e.g., relative risk greater than 2.0 or less than 0.5), a demonstrated dose-response gradient, and plausible confounding that would reduce the observed effect (i.e., the observed effect is likely an underestimate of the true effect given the expected direction of confounding) can each upgrade certainty for observational evidence.

The four GRADE certainty levels and their plain-language meanings:

High: We are very confident that the true effect lies close to that of the estimated effect. Moderate: We are moderately confident in the effect estimate. The true effect is likely close to the estimate of the effect, but there is a possibility that it is substantially different. Low: Our confidence in the effect estimate is limited. The true effect may be substantially different from the estimate of the effect. Very low: We have very little confidence in the effect estimate. The true effect is likely to be substantially different from the estimated effect.

These definitions come from Balshem H, Helfand M, Schünemann HJ, Oxman AD, Kunz R, Brozek J, Vist GE, Falck-Ytter Y, Meerpohl J, Norris S, Guyatt GH. GRADE guidelines: 3. Rating the quality of evidence. J Clin Epidemiol. 2011 Apr;64(4):401-406. DOI: 10.1016/j.jclinepi.2010.07.015.

Building the Summary of Findings Table

The Summary of Findings table is the output document that presents GRADE certainty ratings alongside the key results for each outcome. It is produced using GRADEpro GDT (gradepro.org), a free online platform that generates the table in a standardized format accepted by Cochrane and most journals. The table includes: a description of the patient and intervention; the comparison; the assumed risk in the control group; the corresponding risk in the intervention group (or relative effect); the number of participants and studies; the certainty rating; and a comments field for the most important limitations.

For PRISMA 2020 compliance, PRISMA items 22 and 23 require you to present the results of the main outcomes and the certainty ratings for each. The Summary of Findings table satisfies both requirements.

Need support completing GRADE certainty ratings and Summary of Findings tables?

Assessing five GRADE domains per outcome across every included study, then producing a Summary of Findings table that accurately reflects your appraisal findings, is one of the most methodologically demanding stages of a systematic review. Cochrane MECIR standards require at least two independent reviewers throughout. ScribeLab Writer's systematic review team supports PhD students, MSN and DNP researchers, and faculty teams with risk-of-bias assessment, GRADE certainty ratings, and Summary of Findings tables ready for journal submission.

Visualizing Your Appraisal Results With RobVis

Once you have completed your domain-level appraisal for all included studies, you need to present the results visually. The standard tool for this is RobVis (Risk-of-Bias Visualization).

Citation: McGuinness LA, Higgins JPT. Risk-of-bias Visualization (robvis): An R package and Shiny web application for visualizing risk-of-bias assessments. Res Synth Methods. 2021 Jan;12(1):55-61. DOI: 10.1002/jrsm.1411.

robvis is freely available as an R package and as a browser-based Shiny web application at mcguinlu.github.io/robvis/. You do not need to know R to use it. The Shiny app accepts a template spreadsheet (available in the tool's interface for RoB 2, ROBINS-I, ROBINS-E, QUADAS-2, QUIPS, and a generic format) that you fill in with your domain-level judgments. The tool then generates two visualizations.

Traffic-light plot (rob_traffic_light): A matrix where each row is an included study and each column is a domain. Each cell is colored green (low risk of bias), yellow (some concerns), red (high risk of bias), or grey (no information). This plot shows the domain-level profile for every study simultaneously and is the standard figure in the risk-of-bias section of a published systematic review.

Weighted bar chart (rob_summary): Shows the proportion of studies in each risk-of-bias category for each domain, weighted by the study's contribution to the meta-analysis. This provides an at-a-glance summary of where bias is concentrated across the evidence base.

Both figures are exportable as publication-quality SVG or PNG files. Most journals accept these formats; those requiring TIFF should import the PNG at 600 DPI or higher before converting.

How to Pre-Specify Your Appraisal Tool in the Protocol

One of the most common reasons appraisal sections are criticized in peer review is that the tool was not pre-specified in the protocol. If your PROSPERO record or protocol document says "we will assess risk of bias" without naming the tool, a reviewer will question whether the tool was chosen after seeing the results, which introduces the possibility of outcome-dependent methodology.

Your protocol must name: the specific tool for each study design category; the number of reviewers who will complete appraisal independently; the plan for resolving disagreements (discussion, third reviewer adjudication, or both); and whether risk-of-bias results will be used as a sensitivity analysis criterion (e.g., a sensitivity analysis restricted to low-risk-of-bias studies).

The appraisal section of your methods paper should then report whether the planned approach was followed, and if any deviation occurred, the reason must be stated transparently.

Common Critical Appraisal Mistakes That Lead to Peer Review Rejection

Using NOS for randomized controlled trials. NOS was designed for observational cohort and case-control studies. Applying it to RCTs produces a rating that is not interpretable in the same terms and is immediately recognizable as an error to any reviewer familiar with appraisal methodology. RCTs use RoB 2.

Treating appraisal as a pass/fail quality filter. Some researchers exclude studies rated as high risk of bias from the meta-analysis. This practice is not recommended. High-risk-of-bias studies should remain in the primary analysis, with a sensitivity analysis restricted to low-risk-of-bias studies. Excluding on the basis of appraisal introduces its own selection bias.

Applying RoB 2 once per trial instead of once per outcome. The most structurally important feature of RoB 2 is that it is applied at the level of the result (outcome by analysis), not the study. A trial rated Low for its primary outcome may be rated Some concerns for a secondary outcome because the secondary outcome was not pre-specified.

Not connecting appraisal results to GRADE. Completing the risk-of-bias assessment and then presenting GRADE certainty ratings that do not reflect those assessments is an internal inconsistency that will be identified in peer review. Every GRADE downgrading decision on the "risk of bias" domain must be traceable to specific domain-level judgments in your appraisal tables.

Using CASP as the primary appraisal tool for a quantitative meta-analysis. CASP is appropriate for qualitative and mixed-methods reviews but does not generate domain-level judgments that map onto GRADE. Using CASP for a meta-analysis suggests either that the wrong tool was chosen or that the GRADE certainty ratings were made without a proper foundation.

Failing to use the guidance documents alongside each tool. Every major appraisal tool has a companion guidance document that explains how to interpret and apply each signaling question. RoB 2's guidance runs to several dozen pages. Completing the tool without the guidance document produces unreliable ratings and reduces interrater reliability, as the Minozzi studies document.

Quick Reference: All Appraisal Tools at a Glance

Table 2: Systematic Review Appraisal Tools Summary

Tool

Study Design

Judgment Levels

Free?

Critical Limitation to Know

RoB 2

Randomized controlled trials (RCTs)

Low / Some concerns / High

Yes

Applied per outcome, not per study. One trial may receive different ratings for different outcomes depending on reporting and measurement.

ROBINS-I V2

Non-randomized studies of interventions (NRSI)

Low / Moderate / Serious / Critical / No information

Yes

V2 (November 2025) is currently scoped to follow-up cohort designs only. Use ROBINS-I V1 for other NRSI designs until V2 coverage expands.

NOS

Cohort and case-control studies (observational, not intervention effects)

0–9 stars

Yes

Not formally validated (Stang, Eur J Epidemiol 2010). Do not apply to RCTs or non-randomized intervention studies. For NRSIs, use ROBINS-I V2.

QUADAS-2

Diagnostic accuracy studies

Per-domain: Low / High / Unclear risk of bias, plus Applicability concerns on 3 of 4 domains

Yes

Not for intervention effects. For comparative accuracy studies (two tests head-to-head), apply QUADAS-C alongside QUADAS-2, not instead of it.

CASP

Qualitative, mixed-methods, and scoping reviews

Yes / No / Can't tell per item — generates a profile, not a domain-level risk-of-bias judgment

Yes

Not suitable as the sole appraisal tool for a quantitative meta-analysis. Does not produce the domain-level outputs required for GRADE downgrading.

AMSTAR 2

Systematic reviews (in umbrella reviews — methodological quality)

High / Moderate / Low / Critically low

Yes

No numeric score by design — 16 items are not equally weighted. A single critical flaw drops the confidence rating to Low or Critically low, regardless of other items.

ROBIS

Systematic reviews (in umbrella reviews — risk of bias in conclusions)

Low / High / Unclear — 3 phases, 4 Phase-2 domains

Yes

Assesses risk of bias within a review's conclusions specifically. Complementary to AMSTAR 2 in umbrella reviews — they answer different questions.

SYRCLE

Animal studies (in vivo)

Per-domain (adapted from original Cochrane risk-of-bias structure)

Yes

Limited validation data compared to RoB 2 or ROBINS-I. Use for in vivo animal studies only — not for in vitro or cell-based research.

AXIS

Cross-sectional and survey studies

Per-item (20 questions) — Yes / No / Unsure

Yes

Less widely used in clinical systematic review literature than RoB 2 or ROBINS-I. The JBI critical appraisal checklist is an equally acceptable alternative.

QUIPS

Prognostic studies

Per-domain: Low / Moderate / High (6 domains)

Yes

Specific to prognosis questions only. Not applicable to diagnostic accuracy or intervention studies. For clinical prediction models, use PROBAST instead.

International Context: How Critical Appraisal Standards Differ Across Settings

United States: Cochrane and AHRQ systematic review guidance both require explicit risk-of-bias assessment using validated tools. AHRQ's Evidence-based Practice Center (EPC) program specifies the use of study-design-appropriate tools and the production of evidence tables summarizing appraisal results. Most US nursing, medicine, and public health programs teach RoB 2 and GRADE as the core framework, consistent with the AACN Essentials (2021) requirement for evidence synthesis competency.

United Kingdom: The NICE Evidence Standards Framework for Digital Health Technologies and NICE Systematic Reviews methodology guidance both require RoB 2 for RCTs and GRADE certainty ratings. The National Institute for Health Research (NIHR) and Health Technology Assessment (HTA) program fund systematic reviews that follow Cochrane methodology, which mandates ROBINS-I (now V2) for non-randomized studies.

Australia: The NHMRC (National Health and Medical Research Council) Handbook series and the Australian Living Evidence Consortium both follow Cochrane and GRADE standards. QUADAS-2 is specifically mandated for diagnostic accuracy reviews submitted to the Medicare Services Advisory Committee (MSAC).

Saudi Arabia and UAE: Health Technology Assessment programs at the Saudi Health Council and the Health Technology Assessment division of the UAE Ministry of Health are increasingly adopting GRADE standards, consistent with their use of internationally accredited systematic review methodology for formulary and reimbursement decisions.


Frequently Asked Questions About Critical Appraisal in Systematic Reviews

Can I use RoB 2 for observational studies?

No. RoB 2 was designed specifically for randomized trials. Applying it to observational studies produces a rating that does not reflect the actual sources of bias in those designs (confounding, selection into the study, information bias). For non-randomized intervention studies, use ROBINS-I V2. For cohort and case-control observational studies assessing exposure-outcome associations (not intervention effects), use the Newcastle-Ottawa Scale, with the acknowledged limitation that it was not formally validated.

How do I handle a review that includes both RCTs and non-randomized studies?

Use both tools. Apply RoB 2 to every RCT and ROBINS-I V2 to every NRSI. Report the two sets of appraisal results in separate tables. In the GRADE certainty rating, evidence from ROBINS-I-assessed studies starts at moderate certainty (not high, as RCT evidence does) because non-randomized designs carry structural confounding concerns that the tool ratings quantify but do not eliminate.

What is the difference between risk of bias and study quality?

Risk of bias refers to the specific methodological features of a study that could lead its results to systematically deviate from the true effect. Study quality is a broader concept that can include reporting completeness, statistical rigor, sample size, and relevance. RoB 2 and ROBINS-I specifically assess risk of bias, not overall study quality. A study can be high quality in many respects (large sample, good reporting) while still having a high risk of bias (because of unblinded outcome assessment or missing data), and vice versa.

Does every systematic review need a GRADE Summary of Findings table?

Cochrane requires GRADE Summary of Findings tables for all intervention reviews. For reviews submitted to journals, GRADE is increasingly expected but not universally mandated. Reviews published in clinical guideline contexts almost always require GRADE, as the certainty ratings directly inform recommendation strength. If your journal does not require a formal GRADE table, you should still report certainty-of-evidence judgments in the results section and connect them explicitly to the risk-of-bias findings.

What is the minimum number of reviewers needed for a critical appraisal?

At least two reviewers working independently is the Cochrane standard, and PRISMA 2020 item 8 requires you to report how many reviewers screened records and whether they worked independently (this applies to eligibility determination; the same principle extends to data extraction and appraisal). Single-reviewer appraisal introduces the risk of errors and systematic bias that dual independent review catches. A common acceptable approach where resources are constrained is for one reviewer to complete all appraisals and a second reviewer to independently check a random 20 percent sample, with a full independent appraisal applied to any study that receives a High judgment to verify the rating.

Should I exclude high-risk-of-bias studies from the meta-analysis?

No. The standard practice is to include all studies in the primary analysis and to conduct a pre-specified sensitivity analysis restricted to low-risk-of-bias studies. Excluding studies based on risk of bias introduces its own selection bias (excluding evidence is itself a form of selective reporting) and reduces the power of the primary analysis. What the risk-of-bias assessment changes is the certainty of the pooled evidence, not the composition of the analysis.

What does ROBINS-I V2 change in practice?

ROBINS-I V2 introduces algorithms that take your signaling-question answers and map them to a proposed domain judgment, removing the interpretive ambiguity of V1. It also adds explicit guidance for immortal time bias and introduces a "strong/weak" response option for signaling questions. In practice, reviewers using V2 should expect the domain judgments to be more reproducible between reviewers because the derivation is now algorithmic rather than judgment-based. However, V2 is currently scoped to follow-up cohort designs; for other non-randomized designs, V1 remains the current available standard. Confirm the current scope at riskofbias.info before finalizing your protocol.


Getting Your Critical Appraisal Right Before Submission

Critical appraisal is the section of your systematic review that reveals most clearly whether you have command of the evidence base you are synthesizing. A reviewer who sees RoB 2 applied per outcome, GRADE certainty ratings that trace back to specific domain-level findings, RobVis traffic-light plots, and ROBINS-I V2 used for the non-randomized studies will conclude that the appraisal was conducted by researchers who understand what they are doing. That judgment affects how the rest of the review is read.

The resources you need: the RoB 2 tool and guidance documents at riskofbias.info; ROBINS-I V2 (November 2025 draft) at riskofbias.info/welcome/robins-i-v2; AMSTAR 2 at amstar.ca; QUADAS-2 at quadas.net; GRADEpro GDT at gradepro.org; and robvis at mcguinlu.github.io/robvis/.

Applying the wrong appraisal tool, completing assessment at the study level when the protocol requires outcome-level assessment, or failing to connect risk-of-bias domain judgments to GRADE certainty ratings are the specific errors that peer reviewers flag at the methods section of a systematic review, after months of searching, screening, and extraction are already complete. ScribeLab Writer's systematic review team, led by credentialed researchers with published systematic reviews in the biomedical literature, provides independent second-reviewer appraisal across RoB 2, ROBINS-I V2, QUADAS-2, and GRADE, producing appraisal tables and Summary of Findings tables for reviews targeting Tier 1 and Tier 2 journals. Submit your protocol details, study designs, and target journal through the inquiry form, and a member of the team will respond within 2-4 hours.

About the author

Dr. Alina Grace

Dr. Alina Grace

Meta-Analysis & Synthesis Lead

PhD Epidemiology; MSc Evidence-Based Healthcare

Evidence synthesis lead specializing in PROSPERO-registered systematic reviews and meta-analysis.

View full profile

Ready to Get Your Quote?

Describe your project and a PhD specialist will reply with an itemized quote within 24 hours. No signup, no payment, no obligation.

Prefer email? Send your project details to info@scribelabwriter.com

Chat with us on WhatsApp