Heterogeneity in Meta-Analysis: What to Do When I² Is High

Most researchers treat I² as a pass/fail test for their meta-analysis. An I² of 78 percent triggers panic. An I² of 15% provides relief. Neither reaction is correct, because I² does not measure what most researchers think it measures. Borenstein and colleagues demonstrated in 2017 that I² is a relative statistic. It tells you the proportion of total observed variance attributable to between-study differences, not the actual magnitude of those differences (Borenstein et al., Research Synthesis Methods, 2017). Two meta-analyses can each show I² of 85%, while one shows meaningful between-study variation and the other shows trivial variation. The difference lies in study sample sizes, not in any real difference in the underlying evidence base.

High heterogeneity is the single most common reason peer reviewers request major revisions on a meta-analysis manuscript. Most of those revision requests are not asking you to redo the analysis. They are asking you to interpret it correctly. This guide covers every statistic used to assess heterogeneity, explains what each one does and does not tell you, and gives you a clear decision framework for what to do when heterogeneity is high. For research teams who need statistical support in interpreting and reporting heterogeneity, ScribeLab Writer's systematic review statistical analysis service covers meta-analysis support from model selection through to forest plot generation and manuscript preparation.

Quick Answer:

Three statistics assess heterogeneity in a meta-analysis. Cochran's Q tests whether between-study variation exceeds chance, using a p-value threshold of less than 0.10. I² measures the proportion of total variance due to between-study differences. It is a relative measure, not an absolute one, and high I² alone does not mean your pooled estimate is invalid. Tau² measures the actual variance of true effects across studies and is the more informative absolute measure. When heterogeneity is present, report all three plus a 95 percent prediction interval, which shows the range within which the true effect in a new study would be expected to fall. High unexplained heterogeneity calls for pre-specified subgroup analyses and sensitivity analyses, not abandonment of the meta-analysis. When clinical heterogeneity makes pooling inappropriate, use narrative synthesis under the SWiM reporting guideline.

What Heterogeneity Is and Why It Matters

Heterogeneity refers to variation between the results of studies included in a meta-analysis that exceeds what would be expected from sampling error alone. Three distinct types appear in the literature, and distinguishing between them matters for deciding how to respond.

Statistical heterogeneity is variation in the observed effect estimates across studies. This is what Q, I², and tau² measure. It is the starting point for any heterogeneity assessment.

Clinical heterogeneity is variation in the populations, interventions, comparators, or outcomes across studies. A meta-analysis pooling trials in elderly patients with high comorbidity burden alongside trials in young adults with a single diagnosis shows clinical heterogeneity, regardless of the statistics. Clinical heterogeneity is assessed by the reviewer's judgment, not by any statistical test.

Methodological heterogeneity is variation in study design and risk of bias. Pooling randomized controlled trials with quasi-experimental designs introduces methodological heterogeneity. Risk-of-bias assessment through critical appraisal identifies this source of variation.

These three types are related but not identical. Statistical heterogeneity can exist even when clinical and methodological heterogeneity appear minimal, because subtle differences between populations or interventions can produce real differences in true effects.

Cochran's Q: The Test for Heterogeneity

Cochran's Q tests the null hypothesis that all studies estimate the same true effect. that all between-study variation is due to sampling error alone (Cochran, Biometrics, 1954). The test calculates the weighted sum of squared deviations of each study's effect estimate from the pooled estimate.

A statistically significant Q (conventionally at p less than 0.10, not p less than 0.05) suggests that true heterogeneity exists beyond chance. The 0.10 threshold compensates for the test's known limitations: Q has very low statistical power when few studies are included, making it prone to missing true heterogeneity in small meta-analyses, and it has excessive power when many large studies are included, making it almost certain to reach significance even when heterogeneity is trivially small.

Neither a non-significant Q nor a significant Q tells you how large the heterogeneity is or what to do about it. Q is a test, not a measure. The magnitude of heterogeneity is assessed using I², tau², and prediction intervals.

I²: What It Measures and What It Does Not

The I² statistic, introduced by Higgins and colleagues, calculates what proportion of the total observed variance across studies is attributable to true between-study heterogeneity rather than sampling error alone (Higgins et al., BMJ, 2003;327:557–560).

The formula is: I² = (Q minus df) divided by Q, multiplied by 100 percent.

The commonly cited benchmarks (25 percent low, 50 percent moderate, 75 percent high) come from the original Higgins paper and have been repeated so often that they are now treated as thresholds rather than rough descriptors. They are not thresholds. Higgins and colleagues presented them as tentative guidance, and the Cochrane Handbook explicitly cautions against mechanical application of these cutoffs.

Why I² alone is misleading: the Borenstein argument

Borenstein and colleagues demonstrated a fundamental limitation of I² that changes how you should interpret it (Borenstein et al., Research Synthesis Methods, 2017;8(1):5–18). Because I² is a proportion rather than an absolute value, its magnitude depends on the precision of the included studies. When studies are large and precise, almost all variance is attributable to true between-study differences. So I² is high even when tau² (the actual between-study variance) is small. When studies are small and imprecise, sampling error dominates, so I² can be low even when tau² is substantial.

The practical consequence: a meta-analysis of ten large trials with I² of 80 percent may have much smaller actual between-study variation than a meta-analysis of five small trials with I² of 40 percent. I² tells you the proportion; tau² tells you the magnitude. Both are necessary.

The Cochrane Database of Systematic Reviews has documented a correlation of 0.93 between I² and tau², but only in the specific context of dichotomous outcomes. In other evidence bases, this correlation does not hold. Do not assume that high I² always means large tau².

Tau² and Tau: The Absolute Measure of Between-Study Variance

Tau² (written as τ²) is the between-study variance of the true effects. It represents the spread of the true underlying effects across the population of studies your meta-analysis is sampling from. Tau (τ) is its square root. the standard deviation of true effects. and is easier to interpret because it is on the same scale as the effect sizes.

Higgins and colleagues proposed rough benchmarks: tau of 0.1 represents small heterogeneity, 0.3 medium, and 0.5 large (relative to the typical effects seen in health interventions). These benchmarks are context-dependent. A tau of 0.3 on a log odds ratio scale may be inconsequential for a large preventive intervention and clinically decisive for a high-stakes surgical procedure.

Estimation methods for tau²:

The DerSimonian-Laird (DL) method is the most widely used estimator. It is a moment-based approach that is computationally simple and produces the same confidence intervals regardless of the number of studies. However, DL confidence intervals are known to be too narrow when fewer than 10 studies are included, leading to false precision.

Restricted maximum likelihood (REML) is now the Cochrane recommended default. REML produces more accurate confidence intervals, particularly for small meta-analyses. Cochrane made REML available to all RevMan users on 23 January 2024, and from 1 July 2025, all new protocol submissions to the Cochrane Central Editorial Service must use the new recommended methods, which include REML as the heterogeneity variance estimator, the Q-profile confidence interval for tau², and the Hartung-Knapp-Sidik-Jonkman (HKSJ) adjustment for the summary-effect confidence interval.

The Paule-Mandel and Sidik-Jonkman estimators are available in R and Stata and perform better than DL in specific contexts, but REML is the current recommended default for most applications.

Prediction Intervals: The Number You Are Probably Not Reporting

The prediction interval is the single most clinically informative output from a random-effects meta-analysis, and it is also the most consistently under-reported.

A 95 percent confidence interval around a pooled effect tells you the precision of the average effect estimate. A 95 percent prediction interval tells you the range within which you would expect the true effect in a new, similar study to fall, given the between-study variation you have observed (IntHout et al., BMJ Open, 2016;6(7):e010247).

The distinction matters enormously in practice. A pooled relative risk of 0.75 with a 95 percent confidence interval of 0.65 to 0.87 suggests a beneficial effect with reasonable precision. If the 95 percent prediction interval is 0.45 to 1.25, the evidence is telling you something very different. In a new population, this intervention might produce anywhere from a 55 percent reduction to a 25 percent increase in risk. A clinician reading the confidence interval alone might implement the intervention confidently. A clinician reading the prediction interval would recognize that the effect varies substantially across contexts and populations.

Cochrane now supports prediction intervals in RevMan (added in 2024) and recommends reporting them alongside the pooled estimate. The current Cochrane Handbook Chapter 10, updated November 2024, treats prediction intervals as a standard output for random-effects analyses with two or more studies. They are not optional.

Struggling with high heterogeneity or a reviewer asking for subgroup analysis?
Interpreting I², tau², and prediction intervals correctly, selecting the right statistical model, and responding to reviewer comments on heterogeneity requires expertise in both research synthesis methodology and statistical reporting. ScribeLab Writer's systematic review statistical analysis team supports PhD students, MSN and DNP researchers, and faculty teams with meta-analysis modeling, heterogeneity assessment, subgroup analysis, and forest plot preparation for journal submission.

Struggling with high heterogeneity or a reviewer asking for subgroup analysis?

Interpreting I², tau², and prediction intervals correctly, selecting the right statistical model, and responding to reviewer comments on heterogeneity requires expertise in both research synthesis methodology and statistical reporting. ScribeLab Writer's systematic review statistical analysis team supports PhD students, MSN and DNP researchers, and faculty teams with meta-analysis modeling, heterogeneity assessment, subgroup analysis, and forest plot preparation for journal submission.

Fixed-Effect vs Random-Effects Models

The choice between a fixed-effect and a random-effects model is a methodological decision that must be made before seeing the data and stated in the protocol.

The fixed-effect model assumes that all studies estimate the same single true effect. Any variation in study results is entirely due to sampling error. This model is appropriate only when you have a strong prior reason to believe the studies are functionally identical in population, intervention, comparator, and outcome measurement.

The random-effects model assumes that the true effects vary across studies and that your included studies are a random sample from a distribution of true effects. This model is appropriate whenever meaningful clinical or methodological differences exist between studies, which is almost always the case in health research.

The Cochrane Handbook recommends random-effects models as the default for most systematic reviews, with transparent reporting of which estimator was used for tau².

Table 1: Fixed-Effect vs Random-Effects Model in Meta-Analysis

Feature	Fixed-Effect Model	Random-Effects Model
Core assumption	All studies estimate the same single true effect. Between-study variation is due to sampling error only.	Studies estimate different true effects drawn from a distribution. Between-study variance (tau²) is modeled.
When appropriate	Only when there is strong prior reason to believe all studies are functionally identical in population, intervention, comparator, and outcome.	Whenever meaningful clinical or methodological differences exist between studies. The default in most health research.
Weighting method	Inverse-variance weighting based on within-study variance only.	Inverse-variance weighting based on within-study variance plus tau² (between-study variance).
Recommended tau² estimator	Not applicable (no tau² in fixed-effect model).	REML (restricted maximum likelihood). Cochrane default since January 2024. Mandatory for new Cochrane protocols from July 2025.
Confidence interval adjustment	Standard inverse-variance CI. Can be too narrow when heterogeneity is present.	HKSJ adjustment recommended (Cochrane default 2024). Produces wider, more conservative CIs, especially important with fewer than 10 studies.
Prediction interval	Not applicable. Fixed-effect assumes a single true effect with no distribution of effects across contexts.	Required. Cochrane Handbook (2024) treats prediction intervals as standard output for all random-effects analyses.
Cochrane recommendation	Requires specific documented justification. Not the default.	Recommended as the default for most systematic reviews of health interventions.

The HKSJ adjustment:

The Hartung-Knapp-Sidik-Jonkman (HKSJ) method applies an adjustment to the confidence interval of the summary effect in a random-effects analysis. It produces wider, more conservative confidence intervals. particularly important when fewer than ten studies are included, where DerSimonian-Laird confidence intervals are known to be too narrow. HKSJ is one of the new default methods in RevMan as of January 2024 and is recommended by the Cochrane Handbook for its better coverage properties.

What to Do When Heterogeneity Is High

High heterogeneity is not a reason to abandon the meta-analysis. It is a signal to investigate. The decision framework below reflects the current Cochrane Handbook guidance and the SWiM reporting guideline (Campbell et al., BMJ, 2020;368:l6890).

Step 1: Check for errors first. Before treating high I² as a real finding, verify that all effect estimates are on the same scale and in the same direction. Data entry errors, unit errors, and direction errors produce artificial heterogeneity that inflates I² and Q with no clinical meaning.

Step 2: Assess whether the heterogeneity has a clinical explanation. If the included studies vary in population, intervention intensity, follow-up duration, or outcome measurement tool, those differences are the likely source of statistical heterogeneity. Clinical heterogeneity that fully explains statistical heterogeneity does not require statistical modeling. It requires honest acknowledgment that the studies are not comparable enough to pool, and a narrative synthesis under SWiM is more appropriate than a pooled estimate.

Step 3: If unexplained, use random-effects and report prediction intervals. When heterogeneity is present but has no clear clinical explanation, use a random-effects model with REML estimation, report tau² and tau alongside I², and calculate and report the 95 percent prediction interval. The prediction interval is more informative than I² for communicating what the heterogeneity means clinically.

Step 4: Run pre-specified subgroup analyses. If your protocol pre-specified subgroup analyses for variables likely to explain heterogeneity (age group, intervention dose, setting, risk of bias level), run them. The Q-test for subgroup differences tests whether the pooled effect differs significantly between subgroups. If it does, and if the subgroup division was pre-specified, the heterogeneity has been at least partially explained.

Step 5: If a single study drives the heterogeneity, run a leave-one-out sensitivity analysis. Remove each study in turn and observe the change in I² and the pooled estimate. If removing one study substantially reduces I², that study is an outlier. Investigate why before excluding it. Outlier studies often differ on clinical characteristics that reveal important moderating variables.

Step 6: Consider whether to abandon pooling. Pooling is inappropriate when the clinical heterogeneity is so substantial that a pooled estimate would misrepresent each study's findings. The decision to abandon pooling should be made on clinical grounds, not on a statistical threshold. The SWiM guideline provides a structured reporting framework for narrative synthesis when pooling is not appropriate.

Subgroup Analysis

Subgroup analysis divides the included studies into pre-specified groups and calculates a separate pooled effect for each group. Its purpose is to investigate whether the overall effect differs across clinically meaningful categories.

The test for subgroup differences uses the Q-statistic applied between subgroup estimates rather than within them. A statistically significant result (p less than 0.05 for this test, not p less than 0.10) suggests the effect differs between subgroups.

Pre-specification is not optional. Subgroup analyses conducted after seeing the results are exploratory and must be labeled as such in the manuscript. Post-hoc subgroup analyses have a high rate of false positives and are a common reason peer reviewers request revisions or reject manuscripts.

Most methodologists recommend a minimum of ten studies per subgroup to produce meaningful results. With fewer studies, the subgroup analysis has inadequate power, and confidence intervals are too wide to interpret.

RevMan supports subgroup analysis through its subgroup function. R packages meta and metafor both support subgroup analysis with the test for subgroup differences. Stata's meta command includes subgroup analysis in its suite.

Meta-Regression

Meta-regression models the relationship between study-level characteristics (covariates) and the effect size. It is the analytical extension of subgroup analysis. Instead of dividing studies into discrete groups, it fits a regression line through the effect estimates.

Common covariates include publication year, mean age, proportion female, intervention dose or duration, follow-up length, and risk-of-bias score.

The ten-study-per-covariate rule: Meta-regression requires approximately ten studies per covariate to avoid overfitting. A meta-analysis with twelve studies should include at most one covariate in the meta-regression model. Violating this rule produces spurious associations.

Meta-regression is best treated as exploratory and hypothesis-generating, not confirmatory. The ecological fallacy risk is real: a study-level association between mean age and effect size does not prove that age modifies the effect at the patient level.

Software: The metafor package in R (function rma) provides full meta-regression capabilities, including bubble plots. Stata's metareg command provides similar functionality. RevMan does not support meta-regression, only subgroup analysis.

Sensitivity Analysis

Sensitivity analysis tests whether the conclusions of your meta-analysis are robust to specific methodological decisions. Three types are standard.

Leave-one-out sensitivity analysis: Remove each included study in turn and re-calculate the pooled estimate. If the conclusion changes substantially when any single study is removed, the meta-analysis is not robust. Report whether removing any study changes the direction or statistical significance of the pooled estimate.

Restriction to low risk-of-bias studies: Restrict the analysis to studies rated as low risk of bias on the primary outcome using your appraisal tool. If the pooled estimate changes direction or loses significance, the overall conclusion is sensitive to the quality of the evidence. This is one of the GRADE domains for downgrading certainty.

Pre-specified sensitivity groups: Restrict the analysis to peer-reviewed studies only, or to studies with more than twelve months of follow-up, or to studies with sample sizes above a pre-specified threshold. These analyses test whether specific inclusion decisions affect the conclusion.

All three sensitivity analysis types must be pre-specified in the protocol. Any sensitivity analysis conducted after the primary analysis is exploratory and requires transparent labeling.

How to Write the Heterogeneity Section in Your Manuscript

In the methods section, state which heterogeneity statistics you will report (Q, I², tau², prediction interval), which statistical model you will use (fixed-effect or random-effects), which tau² estimator you selected and why (REML is the Cochrane recommended default), and which subgroup and sensitivity analyses are pre-specified.

In the results section, report the statistics in this sequence: Q statistic (degrees of freedom, p-value), I² (with 95 percent confidence interval where software provides one), tau² and tau, and the 95 percent prediction interval for all random-effects analyses. This sequence maps to PRISMA 2020 items 13c (reporting of heterogeneity statistics) and 14a (results of all included syntheses).

Example reporting paragraph:

"Statistical heterogeneity was substantial (Q = 42.6, df = 11, p less than 0.001; I² = 74.2 percent; tau² = 0.18; tau = 0.42). The 95 percent prediction interval for the random-effects pooled estimate of RR 0.72 (95 percent CI 0.61 to 0.85) was 0.39 to 1.32, indicating that in a new similar study, the true effect could plausibly range from a 61 percent reduction to a 32 percent increase in risk. Pre-specified subgroup analysis by intervention intensity revealed no statistically significant difference between subgroups (Q for subgroup differences = 2.1, df = 1, p = 0.15)."

In the discussion section, interpret what the heterogeneity means clinically. State whether the prediction interval changes the clinical recommendation compared with the point estimate alone. If sensitivity analyses changed the pooled estimate substantially, discuss the implications.

Table 2: Heterogeneity Statistics Reference: What Each Statistic Measures and When to Report It

Statistic	What It Measures	Key Limitation	Required to Report?
Cochran's Q	Tests whether between-study variation exceeds what sampling error alone would produce. Significance threshold: p less than 0.10.	Low power with few studies; excessive power with many large studies. Tests for the presence of heterogeneity only, not its magnitude.	Yes. PRISMA 2020 item 13c.
I²	Proportion of total observed variance attributable to between-study differences. Benchmarks: 25% low, 50% moderate, 75% high (descriptive, not thresholds).	Relative measure, not absolute. Depends on the study sample sizes. High I² does not always mean large tau². Never use alone to decide whether to pool.	Yes. PRISMA 2020 item 14a.
Tau² and tau	Tau² is the between-study variance of true effects (absolute measure). Tau is its square root. On the same scale as the effect sizes, the more interpretable of the two.	Context-dependent: what counts as small or large tau varies by the typical effect size in the field. Estimator choice (REML vs DerSimonian-Laird) affects the value.	Yes. Required alongside I² under current Cochrane Handbook guidance (2024).
95% Prediction interval	The range within which the true effect in a new similar study is expected to fall, given observed between-study variation. The most clinically useful heterogeneity output.	Unreliable with fewer than three studies. Requires a random-effects model. Not applicable in fixed-effect analyses.	Yes, for all random-effects analyses. Cochrane Handbook (2024) treats this as standard output.
Q for subgroup differences	Tests whether the pooled effect differs significantly between pre-specified subgroups. Uses p less than 0.05 threshold (different from the overall Q test).	Low power with few studies per subgroup. A minimum of approximately 10 studies per subgroup for meaningful results. Only interpretable for pre-specified subgroups.	Yes, when subgroup analysis is conducted. PRISMA 2020 item 14a.

Common Heterogeneity Mistakes in Meta-Analysis Manuscripts

Treating I² benchmarks as thresholds. The 25/50/75 percent benchmarks are descriptive suggestions from the original Higgins paper, not decision cutoffs. Stating that I² of 52 percent indicates "moderate heterogeneity, which is acceptable" uses a guideline as a judgment, which peer reviewers will challenge.

Not reporting tau² alongside I². I² and tau² measure different things. A manuscript that reports only I² provides an incomplete picture of heterogeneity, which sophisticated reviewers will immediately flag.

Not reporting prediction intervals. The Cochrane Handbook (2024) treats prediction intervals as a standard output. A random-effects meta-analysis without a prediction interval is incomplete under current reporting standards.

Conducting post-hoc subgroup analyses and reporting them as confirmatory. Any subgroup analysis not pre-specified in the protocol is exploratory. Presenting it as a definitive finding is a form of outcome-reporting bias.

Abandoning meta-analysis because of high I². High I² is not a methodological ground for abandoning pooling. Clinical heterogeneity is. If the studies are clinically comparable enough to pool, a random-effects model with prediction intervals handles the statistical heterogeneity appropriately.

Using fixed-effect models without justification. Fixed-effect models assume all studies estimate the same true effect. This assumption is rarely justified in clinical research. The default should be random-effects with REML unless there is a specific, documented reason to assume a single common effect.

Heterogeneity Across Systematic Review Contexts

Cochrane reviews: REML is now mandatory for new Cochrane protocol submissions, effective July 2025. All Cochrane reviews must report tau² alongside I² and should include prediction intervals for all random-effects analyses. The revised Cochrane Handbook Chapter 10 (November 2024) is the governing reference.

Non-Cochrane journal submissions: Major clinical journals (The Lancet, BMJ, JAMA, NEJM) require PRISMA-2020-compliant reporting of heterogeneity. PRISMA item 13c requires reporting measures of statistical heterogeneity used, and item 14a requires results of any synthesis, including heterogeneity statistics. Manuscripts missing these items are desk-rejected at most high-impact journals.

Nursing and DNP research: Systematic reviews and meta-analyses in nursing journals (Journal of Advanced Nursing, Worldviews on Evidence-Based Nursing, Nursing Research) follow Cochrane and PRISMA standards. The same heterogeneity reporting requirements apply, though nursing-focused reviews often use smaller evidence bases where the HKSJ adjustment is particularly important.

International research: The reporting standards are consistent across the US, UK, Australian, and Gulf-region research contexts because all four regions follow PRISMA 2020 and Cochrane methodology for systematic reviews. The statistical software differs by institution (RevMan in Cochrane-affiliated programs; R and Stata in most independent research programs), but the reporting requirements are identical.

Frequently Asked Questions About Heterogeneity in Meta-Analysis

What I² value indicates that a meta-analysis is too heterogeneous to pool?

No specific I² value determines whether pooling is appropriate. The decision depends on clinical judgment about whether the studies are comparable enough to pool, not on a statistical threshold. Meta-analyses with I² above 75% can still yield valid and useful pooled estimates under a random-effects model with prediction intervals, provided the clinical question justifies combining the studies. The prediction interval, not I², is the most informative guide to whether the pooled estimate is clinically useful.

Is it acceptable to use a fixed-effect model when heterogeneity is low?

Low heterogeneity does not justify choosing a fixed-effect model. The model choice should be based on whether you believe all studies estimate the same single true effect, not on the value of I². A random-effects model is defensible in nearly all clinical research contexts. A fixed-effect model requires a specific, pre-specified justification.

What is the difference between a subgroup analysis and meta-regression?

Subgroup analysis divides studies into discrete categories (e.g., high vs low dose) and compares the pooled effect across groups. Meta-regression fits a continuous regression line relating a study-level covariate to the effect size. Subgroup analysis is appropriate for categorical moderators with a clear theoretical basis. Meta-regression is appropriate for continuous moderators or when you want to model the relationship between a covariate and effect size across the full range of observed values.

Can I run meta-regression in RevMan?

No. RevMan does not support meta-regression. Use the metafor or meta package in R, Stata's metareg command, or Comprehensive Meta-Analysis (CMA) software for meta-regression analyses.

My prediction interval crosses the line of no effect, but my pooled estimate is statistically significant. What do I report?

Report both findings and interpret both clearly. The statistically significant pooled estimate shows that across the included studies, the average effect is distinguishable from zero. The prediction interval crossing the line of no effect shows that in a new study with characteristics similar to those in your evidence base, the effect might not be protective. This combination is clinically important: it means the intervention works on average but with substantial variability across contexts. Your discussion should address whether this variability limits the strength of the clinical recommendation.

How do I report heterogeneity when I have fewer than three studies?

With fewer than three studies, Cochran's Q has very low power, and I² is highly unreliable. In this situation, report Q and I² but note their unreliability explicitly, do not interpret the p-value for Q as a definitive test of heterogeneity, and do not report a prediction interval (the t-distribution it relies on is unreliable with very few studies). Consider whether meta-analysis is appropriate at all with fewer than three studies, or whether narrative synthesis would better serve the clinical question.

What does it mean if removing one study in leave-one-out sensitivity analysis changes my conclusion?

It means your conclusion is sensitive to the inclusion of that study. This is not a flaw in your analysis. It is a finding that requires transparency. Report the sensitivity analysis results, describe how that study differs from the others (population, setting, risk of bias, intervention intensity), and acknowledge that the conclusion depends on whether that study's context generalizes to the target population. If the study has a high risk of bias, its influence on the overall conclusion is a GRADE downgrading consideration.

You May Also Find Useful

Reporting Heterogeneity That Survives Peer Review

A meta-analysis that correctly handles heterogeneity reports Q, I², tau², and prediction intervals for each random-effects synthesis. It uses REML as the default tau² estimator. It prespecifies subgroup analyses and labels any post hoc analyses as exploratory. It decides to pool or not pool on clinical grounds, not on a statistical threshold. And it presents the prediction interval alongside the pooled estimate so the reader can see what the evidence means for the next population, not just for the average of those already studied.

The goal is not to minimize heterogeneity or to explain it away. The goal is to quantify it accurately, communicate it clearly, and let the reader understand what the variation means for clinical practice or policy.

Reporting I² without tau² or prediction intervals, selecting a fixed-effect model without pre-specified justification, or omitting the REML estimator from the heterogeneity report are the statistical errors that reviewers at JAMA, The Lancet, and Systematic Reviews identify immediately, and that require reanalysis after the manuscript has already been submitted. ScribeLab Writer's systematic review statistical analysis team, led by credentialed researchers with published systematic reviews in the biomedical literature, covers model selection, heterogeneity statistics, prediction intervals, subgroup and sensitivity analysis, and forest plot preparation for reviews targeting Tier 1 and Tier 2 clinical journals. Submit your data, current analysis output, and target journal via the inquiry form, and a team member will respond within 2-4 hours.

All Articles Start Your Project