conninfpy.harmonize
Multi-site harmonization for connectivity data, with design-matrix diagnostics.
Implements parametric empirical-Bayes ComBat (Johnson, Li & Rabinovic
2007; Fortin et al. 2017/2018) directly in NumPy — no dependency on
neuroHarmonize or neurocombat.
ComBat model per feature \(j\):
where \(s\) indexes site, \(i\) indexes subject within site, and \(X_{s,i}\) holds covariates to preserve (age, sex, diagnosis). Estimation uses an empirical-Bayes shrinkage of per-(site, feature) location \(\gamma\) and scale \(\delta^2\) toward site-level priors, iterated to convergence.
Use this when your cohort pools subjects from multiple scanners / sites and you want to remove site-aligned variance that is orthogonal to the biological covariates you care about.
The combat_fit() / combat_apply() separation supports
cross-site machine-learning transfer (fit ComBat on training cohorts,
apply to held-out sites at test time).
The compute_vif() and design_diagnostics() helpers are
included here because confound regression (in glm_stats)
and harmonization both lean on a well-conditioned design matrix.
References
Johnson, W. E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118–127. doi:10.1093/biostatistics/kxj037.
Fortin, J.-P. et al. (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage 161:149–170.
Fortin, J.-P. et al. (2018). Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167:104–120.
Multi-site harmonization for connectivity data.
Implements parametric empirical-Bayes ComBat (Johnson, Li & Rabinovic 2007;
Fortin et al. 2017) directly in numpy — no dependency on neuroHarmonize
or neurocombat. Plus design-matrix diagnostics (VIF, condition number).
ComBat model per feature j:
Y_ij_s = α_j + X_i β_j + γ_sj + δ_sj · ε_ij_s, ε ~ N(0, σ_j²)
where s indexes site, i indexes subject within site, and X_i holds
covariates to preserve (age, sex, diagnosis). The parameterization follows
Fortin 2017 / neuroCombat: site dummies are encoded one-hot (no reference
site dropped); α is the sample-size-weighted grand mean of per-site
intercepts; σ² uses the biased var.pooled denominator (divide by
n, not n − p). This matches the canonical neuroCombat reference
implementation to machine precision under EB and without — validated in
tests/test_combat_equivalence.py.
Estimation:
Fit OLS with site dummies + preserved covariates to obtain per-site intercepts,
β̂, and pooledσ̂_j.Standardize residuals per feature.
Estimate per-(site, feature) location
γ̂and scaleδ̂²from the standardized data.Empirical Bayes shrinkage using site-level conjugate priors, iterated to convergence.
Adjust: subtract the shrunken site effect, re-add the preserved-covariate fit and the pooled scale.
Use this when your cohort pools subjects from multiple scanners/sites and you want to remove the site-aligned variance that is orthogonal to the biological covariates you care about. Fortin 2018 discusses the limits.
Typical use
>>> from conninfpy import combat_harmonize
>>> result = combat_harmonize(Y, sites=site_labels, preserve=covariates)
>>> Y_adj = result.Y_adjusted # same shape as Y
>>> result.diagnostics # ratio of between-site variance before/after
References
Johnson WE, Li C, Rabinovic A (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118-27.
Fortin J-P, Parker D, Tunç B, et al. (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage 161:149-170.
Fortin J-P et al. (2018). Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167:104-120.
- class conninfpy.harmonize.ComBatModel(site_labels: ~typing.List[~typing.Any], alpha: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] = <factory>, beta: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] | None = None, sigma: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] = <factory>, gamma_star: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] = <factory>, delta_star: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] = <factory>, eb: bool = True, preserve_n_cols: int = 0)[source]
Bases:
objectFitted parametric ComBat state, for later application to new data.
- class conninfpy.harmonize.CombatResult(Y_adjusted: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]], model: ~conninfpy.harmonize.ComBatModel, diagnostics: ~typing.Dict[str, ~typing.Any] = <factory>)[source]
Bases:
objectReturn value of
combat_harmonize().- model: ComBatModel
- conninfpy.harmonize.combat_harmonize(Y: ndarray[tuple[Any, ...], dtype[float64]], sites: Sequence[Any], preserve: ndarray[tuple[Any, ...], dtype[float64]] | None = None, eb: bool = True, return_diagnostics: bool = True) CombatResult[source]
Fit + transform: parametric empirical-Bayes ComBat on the full cohort.
See module docstring for the model and references. Handles either
(n, N, N)connectivity matrices or(n, p)pre-flattened features.- Parameters:
Y (ndarray of shape (n, N, N) or (n, p)) – Input data.
sites (sequence of length n) – Site label per subject.
preserve (ndarray of shape (n,) or (n, k), optional) – Covariates whose variance should be preserved through ComBat. Pass the GLM nuisance design (age, sex, motion, …) here and omit the variable that downstream inference will test — that avoids the Nygaard 2016 label leak in which the harmonization fit absorbs label-aligned variance the permutation null cannot recover. See
analyze(), which setspreserve = confoundsautomatically under Strategy D, and [[paper_combat_resolution_strategies]] for the full derivation.eb (bool, default=True) – Apply empirical-Bayes shrinkage on site effects (recommended).
return_diagnostics (bool, default=True) – If True, compute a between-site / within-site variance ratio before and after correction, plus the explicit
between_site_variance_ratio_after_over_beforekey used byconninfpy.analyze()for residual-site-variance flags.
- Returns:
Y_adjusted(same shape as input),model, anddiagnostics(between-site variance ratio before/after, after/before ratio, ratio reduction, and per-site sample sizes).- Return type:
- conninfpy.harmonize.combat_fit(Y: ndarray[tuple[Any, ...], dtype[float64]], sites: Sequence[Any], preserve: ndarray[tuple[Any, ...], dtype[float64]] | None = None, eb: bool = True) ComBatModel[source]
Fit parametric ComBat on a training cohort.
- Parameters:
Y (ndarray of shape (n, N, N) or (n, p)) – Connectivity matrices (3D) or pre-flattened features (2D).
sites (sequence of length n) – Site label per subject. Any hashable (strings, ints).
preserve (ndarray of shape (n,) or (n, k), optional) – Covariates whose effect on Y should be preserved (e.g. diagnosis, age, sex). Their fitted coefficients survive harmonization.
eb (bool, default=True) – If True (standard ComBat), apply empirical-Bayes shrinkage on the per-(site, feature) effects. If False, use the raw method-of-moments estimates — faster but noisier for small per-site sample sizes.
- Returns:
Fitted state, usable with
combat_apply().- Return type:
- conninfpy.harmonize.combat_apply(model: ComBatModel, Y: ndarray[tuple[Any, ...], dtype[float64]], sites: Sequence[Any], preserve: ndarray[tuple[Any, ...], dtype[float64]] | None = None) ndarray[tuple[Any, ...], dtype[float64]][source]
Apply a fitted ComBat model to new data. Sites must all be present in the training cohort; unseen sites raise ValueError.
- conninfpy.harmonize.compute_vif(X: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[float64]][source]
Variance inflation factor per design-matrix column.
VIF_j = 1 / (1 - R²_j)whereR²_jis from regressing column j on all other columns.VIF > 5is a common concern threshold;> 10is strong collinearity (Kutner et al. 2004).Constant columns (e.g. intercept) get
VIF = 1by convention.- Parameters:
X (ndarray of shape (n, p)) – Design matrix.
- Returns:
VIF per column.
- Return type:
ndarray of shape (p,)
- conninfpy.harmonize.design_diagnostics(X: ndarray[tuple[Any, ...], dtype[float64]], names: Sequence[str] | None = None) Dict[str, Any][source]
Diagnostic report for a GLM design matrix.
- Returns a dict with:
condition_number: condition number ofX'X(> 30 is caution, > 100 is strong concern)vif: VIF per columnvif_max,vif_max_col: most collinear columncorrelation: pairwise Pearson correlation matrix (p, p)flags: list of plain-English warnings when thresholds are crossed
- Parameters:
X (ndarray of shape (n, p)) – Design matrix (should include intercept if used in the GLM).
names (sequence of str, optional) – Column names for reporting. Defaults to
['x0', 'x1', ...].
- conninfpy.harmonize.flatten_upper(Y: ndarray[tuple[Any, ...], dtype[float64]]) Tuple[ndarray[tuple[Any, ...], dtype[float64]], Tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], ndarray[tuple[Any, ...], dtype[_ScalarT]]], int][source]
Flatten the upper triangle (k=1) of a (n, N, N) connectivity stack.
- Parameters:
Y (ndarray of shape (n_subjects, N, N)) – Symmetric connectivity matrices with zero diagonal.
- Returns:
features (ndarray of shape (n_subjects, N*(N-1)/2)) – Flattened upper triangles — one feature per edge.
triu_idx (tuple of ndarray) –
(rows, cols)index tuple usable asM[triu_idx]to invert.N (int) – Side length of the original matrices.
- conninfpy.harmonize.unflatten_upper(features: ndarray[tuple[Any, ...], dtype[float64]], triu_idx: Tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], ndarray[tuple[Any, ...], dtype[_ScalarT]]], N: int) ndarray[tuple[Any, ...], dtype[float64]][source]
Reconstruct (n, N, N) symmetric matrices (zero diagonal) from flattened upper-triangle features.
- conninfpy.harmonize.block_mass(p_full: ndarray[tuple[Any, ...], dtype[float64]], net_labels: ndarray[tuple[Any, ...], dtype[int64]], alpha: float = 0.05, *, return_upper: bool = True) ndarray[tuple[Any, ...], dtype[int64]][source]
Aggregate edge-level survival counts into a network-block matrix.
For an upper-triangular set of significant edges (
p <= alpha), count how many fall within each unordered pair of network labels(i, j). The diagonal counts within-network edges.- Parameters:
p_full (ndarray of shape (N, N)) – Edge-wise p-value matrix (typically symmetric, with diagonal=1).
net_labels (ndarray of shape (N,)) – Integer network assignment per node,
0..K-1. The Yeo-7 partition stored inabide_prepared.npz['net_labels']is the canonical use case.alpha (float, default
0.05) – Significance threshold; edges withp <= alphaare counted.return_upper (bool, default
True) – IfTrue, the returned matrix is upper-triangular (lower triangle zero-filled). IfFalse, the matrix is symmetric.
- Returns:
M – Block-mass matrix where
M[i, j]is the number of significant edges with one endpoint in networkiand the other inj.- Return type:
ndarray of shape (K, K), int
Examples
>>> import numpy as np >>> from conninfpy import block_mass >>> p = np.array([[1, 0.01, 0.5], [0.01, 1, 0.6], [0.5, 0.6, 1]]) >>> labels = np.array([0, 0, 1]) >>> block_mass(p, labels, alpha=0.05) array([[1, 0], [0, 0]])