## Abstract

Neuroimaging relies on separate statistical inferences at tens of thousands of spatial locations. Such massively univariate analysis typically requires an adjustment for multiple testing in an attempt to maintain the family-wise error rate at a nominal level of 5%. First, we examine three sources of substantial information loss that are associated with the common practice under the massively univariate framework: (a) the hierarchical data structures (spatial units and trials) are not well maintained in the modeling process; (b) the adjustment for multiple testing leads to an artificial step of strict thresholding; (c) information is excessively reduced during both modeling and result reporting. These sources of information loss have far-reaching impacts on result interpretability as well as reproducibility in neuroimaging. Second, to improve inference efficiency, predictive accuracy, and generalizability, we propose a Bayesian multilevel modeling framework that closely characterizes the data hierarchies across spatial units and experimental trials. Rather than analyzing the data in a way that first creates multiplicity and then resorts to a post hoc solution to address them, we suggest directly incorporating the cross-space information into one single model under the Bayesian framework (so there is no multiplicity issue). Third, regardless of the modeling framework one adopts, we make four actionable suggestions to alleviate information waste and to improve reproducibility: (1) model data hierarchies, (2) quantify effects, (3) abandon strict dichotomization, and (4) report full results. We provide examples for all of these points using both demo and real studies, including the recent Neuroimaging Analysis Replication and Prediction Study (NARPS).

**DOI:** 10.52294/2e179dbf-5e37-4338-a639-9ceb92b055ea

**Keywords:** Bayesian multilevel modeling; data hierarchy; dichotomization; effect magnitude; information waste; multiple testing problem; result reporting

**INTRODUCTION**

*Statisticians classically asked the wrong question – and were willing to answer with a lie. They asked “Are the effects of A and B different?” and they were willing to answer “no.”*

*All we know about the world teaches us that the effects of A and B are always different – in some decimal place – for any A and B. Thus asking “are the effects different?” is foolish.*

John W. Tukey, “The Philosophy of Multiple Comparisons,” Statistical Science (1991)

Functional magnetic resonance imaging (FMRI) is a mainstay technique of human neuroscience, which allows the study of the neural correlates of many functions, including perception, emotion, and cognition. The basic spatial unit of FMRI data is a *voxel* ranging from 1 to 3 mm on each side. As data are collected across time when a participant performs tasks or remains at “rest,” FMRI datasets contain a time series at each voxel. Typically, tens of thousands of voxels are analyzed simultaneously. Such a “divide and conquer” approach through *massively univariate analysis* necessitates some form of multiple testing adjustment via procedures based on Bonferroni’s inequality or false discovery rate.

Conventional neuroimaging inferences follow the null hypothesis significance testing framework, where the decision procedure dichotomizes the available evidence into two categories at the end. Thus, one part of the evidence survives an adjusted threshold at the whole-brain level and is considered *statistically significant* (informally interpreted as a “true” effect); the other part is ignored (often misinterpreted as “not true”) and by convention omitted and hidden from public view (i.e., the file drawer problem).

A recent study^{1} (referred to as NARPS hereafter) offers a salient opportunity for the neuroimaging community to reflect about common practices in statistical modeling and the communication of study findings. The study recruited 70 teams charged with the task of analyzing a particular FMRI dataset and reporting results; the teams simply were asked to follow data analyses routinely employed in their labs at the whole-brain voxel level (but note that nine specific research hypotheses were restricted to only three brain regions). NARPS found large variability in reported decisions, which were deemed to be sensitive to analysis choices ranging from preprocessing steps (e.g., spatial smoothing, head motion correction) to the specific approach used to handle multiple testing. Based on these findings, NARPS outlined potential recommendations for the field of neuroimaging research.

Despite useful lessons revealed by the NARPS investigation, the project also exemplifies the common approach in neuroimaging of generating categorical inferential conclusions as encapsulated by the “significant versus nonsignificant” maxim. In this context, we address the following questions:

Are conventional multiple testing adjustment methods informationally wasteful?

NARPS suggested that there was “substantial variability” in reported results across teams of investigators analyzing the same dataset. Is this conclusion dependent, at least in part, on the common practice of ignoring spatial hierarchy at the global level and drawing inferences binarily (i.e., “significant” vs. “nonsignificant”)?

What changes can the neuroimaging field make in modeling and result reporting to improve reproducibility?

In this context, we consider inferential procedures not strictly couched in the standard null hypothesis significance testing framework. Rather, we suggest that multilevel models, particularly when constructed within a Bayesian framework, provide powerful tools for the analysis of neuroimaging studies given the data’s inherent hierarchical structure. As our paper focuses on hierarchical modeling and dichotomous thinking in neuroimaging, we do not discuss the broader literature on Bayesian methods applied to FMRI.^{2}

**MASSIVELY UNIVARIATE ANALYSIS AND MULTIPLE TESTING**

We start with a brief refresher of the conventional statistical framework typically adopted in neuroimaging. Statistical testing begins by accepting the null hypothesis but then rejecting it in favor of the alternative hypothesis if the current data for the effect of interest (e.g., task A vs. task B) or potentially more extreme observations are unlikely to occur under the assumption that the effect is absolutely zero. Because the basic data unit is the voxel, one faces the problem of performing tens of thousands of inferences across space *simultaneously*. As the spatial units are not independent of one another, adopting an adjustment such as Bonferroni’s is unreasonably conservative. Instead, the field has gradually settled into employing a cluster-based approach: what is the size of a spatial cluster that would be unlikely to occur under the null scenario?

Accordingly, a two-step procedure is utilized: first threshold the voxelwise statistical evidence at a particular (or a range of) voxelwise *p*-value (e.g., 0.001) and then consider only contiguous clusters of evidence (Fig. 1). Several adjustment methods have been developed to address multiple testing by leveraging the spatial relatedness among neighboring voxels. The stringency of the procedures has been extensively debated over the past decades, with the overall probability of having clusters of a minimum spatial extent given a null effect estimated by two common approaches: a parametric method^{3},^{4} and a permutation-based adjustment.^{5} For the former, recent recommendations have resulted in the convention of adopting a primary threshold of voxelwise *p* = 0*.*001 followed by cluster size determination^{6},^{7}; for the latter, the threshold is based on the integration between a range of statistical evidence and the associated spatial extent.^{5}

**Problems of multiple testing adjustments**

At least five limitations are associated with multiple testing adjustments leveraged through spatial extent.^{8}

*Conceptual inconsistency*. Consider that the staples of neuroimaging research are the maps of statistical evidence and associated tables. Both typically present only the statistic (e.g.,*t*) values. However, this change of focus is inconsistent with cluster-based inference: after multiple testing adjustment, the proper unit of inference is the cluster, not the voxel. Once “significant” clusters are determined, one*should*only speak of clusters and the voxels inside each cluster*should*no longer be considered meaningful inferentially. In other words, the statistical evidence for each surviving cluster is deemed at the “significance” level of 0.05 and the voxelwise statistic values lose direct interpretability. Therefore, voxel-level statistic values in brain maps and tables in the literature should not be taken at face value.*Spatial ambiguity*. As a cluster is purely defined through statistical evidence, it is usually not aligned with any anatomical region, presenting a spatial specificity problem. To resolve the issue, the investigator typically reduces the cluster to a “peak” voxel with the highest statistical value and uses its location as evidence for the underlying region. A conceptual inconsistency results from these two transitional steps: one from a cluster to its peak voxel and then another from the voxel to an anatomical region. Furthermore, when a cluster spans over more than one anatomical region, no definite solutions are available to resolve the inferential difficulty. Although these issues of conceptual inconsistency and spatial ambiguity have been discussed in the past,^{7},^{8}it remains underappreciated, and researchers commonly do not adjust their presentations to match the cluster-level effective resolution.*Heavy penalty against small regions.*With the statistical threshold at the spatial unit level traded off with cluster extent, larger regions might be able to survive with relatively weaker statistical evidence while smaller regions would have to require much stronger statistical strength. Therefore, multiple testing adjustments always penalize small clusters. Regardless of the specific adjustment method, anatomically small regions (e.g., those in the subcortex) are intrinsically disadvantaged even if they have the same amount of statistical evidence. In other words, ideally, the evidence for a brain region should be assessed solely in light of its effect magnitude, not dependent on its anatomical size; thus, the conventional multiple testing adjustment approaches are unfair to small regions because of its heavy reliance on spatial relatedness among the contiguous neighborhood.*Sensitivity to data domain.*As the penalty for multiplicity becomes heavier when more spatial units are involved, one could explore various surviving clusters by changing the data space, resulting in some extent of arbitrariness: even though the data remain the same, a cluster may survive or fail depending on the investigator’s choice of spatial extent for the data. Because of this vulnerability, it is not easy to draw a clear line between a justifiable reduction of data and an exploratory search (e.g., “small volume correction”).*Difficulty of assigning uncertainty.*As the final results are inferred at the cluster level, there is no clear uncertainty that can be attached to the effect magnitude at the cluster level. On the one hand, a cluster either survives or not under a dichotomous decision. On the other hand, due to the interpretation difficulty of voxel-level statistical evidence, it remains challenging to have, for example, a standard error (“error bar”) associated with the average effect at the cluster level.

Multiple testing adjustments are tied to the current result reporting practice. It is worth remembering a key goal of data processing and statistical modeling: to take a massive amount of data that is not interpretable in its raw state and to extract and distill meaningful information. The preprocessing parts aim to reduce distortion effects, whereas statistical models intend to account for various effects. Overall, there is a broad trade-off along the “analysis pipeline”: we increase the digestibility of the information at the cost of reducing information. Fig. 2A illustrates these key aspects of the process of information extraction in standard FMRI analysis. The input data of time series across the brain for multiple participants are rich in information but of course not easily interpretable or “digestible.” After multiple preprocessing steps followed by massively univariate analysis, the original data are condensed into two pieces of information at each spatial unit: the point estimate and the standard error. Whereas this process entails considerable reduction of information, it produces usefully digestible results; we highlight this trade-off in Fig. 2B. Here, “information” refers broadly to the amount and content of data present in a stage (e.g., for the raw data, the number of groups, participants, time series lengths). “Digestibility” refers to the ease with which the data are presentable and understandable (e.g., two 3D volumes vs. one; a 3D volume vs. a table of values). Following common practice, many investigators discard effect magnitude information to focus on summary statistics, which are then used to make binarized inferences by taking into account multiple testing. These steps certainly aid in reporting results and summarizing potentially some notable aspects of the data. However, we argue that the overall procedure leads to information waste and that the gained digestibility is relatively small (in addition to generating problems when results are compared across studies). Whereas we focus our discussion on whole-brain voxel-based analyses, similar issues apply in other types of analysis for region-based and matrix-based data.