conninfpy.harmonize

Multi-site harmonization for connectivity data, with design-matrix diagnostics.

Implements parametric empirical-Bayes ComBat (Johnson, Li & Rabinovic 2007; Fortin et al. 2017/2018) directly in NumPy — no dependency on neuroHarmonize or neurocombat.

ComBat model per feature \(j\):

\[Y_{s,i,j} = \alpha_j + X_{s,i}\,\beta_j + \gamma_{s,j} + \delta_{s,j}\,\varepsilon_{s,i,j}, \qquad \varepsilon \sim \mathcal{N}(0, \sigma_j^2),\]

where \(s\) indexes site, \(i\) indexes subject within site, and \(X_{s,i}\) holds covariates to preserve (age, sex, diagnosis). Estimation uses an empirical-Bayes shrinkage of per-(site, feature) location \(\gamma\) and scale \(\delta^2\) toward site-level priors, iterated to convergence.

Use this when your cohort pools subjects from multiple scanners / sites and you want to remove site-aligned variance that is orthogonal to the biological covariates you care about.

The combat_fit() / combat_apply() separation supports cross-site machine-learning transfer (fit ComBat on training cohorts, apply to held-out sites at test time).

The compute_vif() and design_diagnostics() helpers are included here because confound regression (in glm_stats) and harmonization both lean on a well-conditioned design matrix.

References

Johnson, W. E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118–127. doi:10.1093/biostatistics/kxj037.

Fortin, J.-P. et al. (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage 161:149–170.

Fortin, J.-P. et al. (2018). Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167:104–120.

Multi-site harmonization for connectivity data.

Implements parametric empirical-Bayes ComBat (Johnson, Li & Rabinovic 2007; Fortin et al. 2017) directly in numpy — no dependency on neuroHarmonize or neurocombat. Plus design-matrix diagnostics (VIF, condition number).

ComBat model per feature j:

Y_ij_s = α_j + X_i β_j + γ_sj + δ_sj · ε_ij_s, ε ~ N(0, σ_j²)

where s indexes site, i indexes subject within site, and X_i holds covariates to preserve (age, sex, diagnosis). The parameterization follows Fortin 2017 / neuroCombat: site dummies are encoded one-hot (no reference site dropped); α is the sample-size-weighted grand mean of per-site intercepts; σ² uses the biased var.pooled denominator (divide by n, not n − p). This matches the canonical neuroCombat reference implementation to machine precision under EB and without — validated in tests/test_combat_equivalence.py.

Estimation:

Fit OLS with site dummies + preserved covariates to obtain per-site intercepts, β̂, and pooled σ̂_j.
Standardize residuals per feature.
Estimate per-(site, feature) location γ̂ and scale δ̂² from the standardized data.
Empirical Bayes shrinkage using site-level conjugate priors, iterated to convergence.
Adjust: subtract the shrunken site effect, re-add the preserved-covariate fit and the pooled scale.

Use this when your cohort pools subjects from multiple scanners/sites and you want to remove the site-aligned variance that is orthogonal to the biological covariates you care about. Fortin 2018 discusses the limits.

Typical use

>>> from conninfpy import combat_harmonize
>>> result = combat_harmonize(Y, sites=site_labels, preserve=covariates)
>>> Y_adj = result.Y_adjusted   # same shape as Y
>>> result.diagnostics           # ratio of between-site variance before/after

References

Johnson WE, Li C, Rabinovic A (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118-27.

Fortin J-P, Parker D, Tunç B, et al. (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage 161:149-170.

Fortin J-P et al. (2018). Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167:104-120.

class conninfpy.harmonize.ComBatModel(site_labels: ~typing.List[~typing.Any], alpha: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] = <factory>, beta: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] | None = None, sigma: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] = <factory>, gamma_star: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] = <factory>, delta_star: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]] = <factory>, eb: bool = True, preserve_n_cols: int = 0)[source]

Bases: object

Fitted parametric ComBat state, for later application to new data.

site_labels: List[Any]

alpha: ndarray[tuple[Any, ...], dtype[float64]]

beta: ndarray[tuple[Any, ...], dtype[float64]] | None = None

sigma: ndarray[tuple[Any, ...], dtype[float64]]

gamma_star: ndarray[tuple[Any, ...], dtype[float64]]

delta_star: ndarray[tuple[Any, ...], dtype[float64]]

eb: bool = True

preserve_n_cols: int = 0

class conninfpy.harmonize.CombatResult(Y_adjusted: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]], model: ~conninfpy.harmonize.ComBatModel, diagnostics: ~typing.Dict[str, ~typing.Any] = <factory>)[source]

Bases: object

Return value of combat_harmonize().

Y_adjusted: ndarray[tuple[Any, ...], dtype[float64]]

model: ComBatModel

diagnostics: Dict[str, Any]

conninfpy.harmonize.combat_harmonize(Y: ndarray[tuple[Any, ...], dtype[float64]], sites: Sequence[Any], preserve: ndarray[tuple[Any, ...], dtype[float64]] | None = None, eb: bool = True, return_diagnostics: bool = True) → CombatResult[source]

Fit + transform: parametric empirical-Bayes ComBat on the full cohort.

See module docstring for the model and references. Handles either (n, N, N) connectivity matrices or (n, p) pre-flattened features.

Parameters:

Y (ndarray of shape (n, N, N) or (n, p)) – Input data.
sites (sequence of length n) – Site label per subject.
preserve (ndarray of shape (n,) or (n, k), optional) – Covariates whose variance should be preserved through ComBat. Pass the GLM nuisance design (age, sex, motion, …) here and omit the variable that downstream inference will test — that avoids the Nygaard 2016 label leak in which the harmonization fit absorbs label-aligned variance the permutation null cannot recover. See analyze(), which sets preserve = confounds automatically under Strategy D, and [[paper_combat_resolution_strategies]] for the full derivation.
eb (bool, default=True) – Apply empirical-Bayes shrinkage on site effects (recommended).
return_diagnostics (bool, default=True) – If True, compute a between-site / within-site variance ratio before and after correction, plus the explicit between_site_variance_ratio_after_over_before key used by conninfpy.analyze() for residual-site-variance flags.

Returns:

Y_adjusted (same shape as input), model, and diagnostics (between-site variance ratio before/after, after/before ratio, ratio reduction, and per-site sample sizes).

Return type:

CombatResult

conninfpy.harmonize.combat_fit(Y: ndarray[tuple[Any, ...], dtype[float64]], sites: Sequence[Any], preserve: ndarray[tuple[Any, ...], dtype[float64]] | None = None, eb: bool = True) → ComBatModel[source]

Fit parametric ComBat on a training cohort.

Parameters:

Y (ndarray of shape (n, N, N) or (n, p)) – Connectivity matrices (3D) or pre-flattened features (2D).
sites (sequence of length n) – Site label per subject. Any hashable (strings, ints).
preserve (ndarray of shape (n,) or (n, k), optional) – Covariates whose effect on Y should be preserved (e.g. diagnosis, age, sex). Their fitted coefficients survive harmonization.
eb (bool, default=True) – If True (standard ComBat), apply empirical-Bayes shrinkage on the per-(site, feature) effects. If False, use the raw method-of-moments estimates — faster but noisier for small per-site sample sizes.

Returns:

Fitted state, usable with combat_apply().

Return type:

ComBatModel

conninfpy.harmonize.combat_apply(model: ComBatModel, Y: ndarray[tuple[Any, ...], dtype[float64]], sites: Sequence[Any], preserve: ndarray[tuple[Any, ...], dtype[float64]] | None = None) → ndarray[tuple[Any, ...], dtype[float64]][source]: Apply a fitted ComBat model to new data. Sites must all be present in the training cohort; unseen sites raise ValueError.

conninfpy.harmonize.compute_vif(X: ndarray[tuple[Any, ...], dtype[float64]]) → ndarray[tuple[Any, ...], dtype[float64]][source]

Variance inflation factor per design-matrix column.

VIF_j = 1 / (1 - R²_j) where R²_j is from regressing column j on all other columns. VIF > 5 is a common concern threshold; > 10 is strong collinearity (Kutner et al. 2004).

Constant columns (e.g. intercept) get VIF = 1 by convention.

Parameters:: X (ndarray of shape (n, p)) – Design matrix.
Returns:: VIF per column.
Return type:: ndarray of shape (p,)

conninfpy.harmonize.design_diagnostics(X: ndarray[tuple[Any, ...], dtype[float64]], names: Sequence[str] | None = None) → Dict[str, Any][source]

Diagnostic report for a GLM design matrix.

Returns a dict with:

condition_number: condition number of X'X (> 30 is caution, > 100 is strong concern)
vif: VIF per column
vif_max, vif_max_col: most collinear column
correlation: pairwise Pearson correlation matrix (p, p)
flags: list of plain-English warnings when thresholds are crossed

Parameters:

X (ndarray of shape (n, p)) – Design matrix (should include intercept if used in the GLM).
names (sequence of str, optional) – Column names for reporting. Defaults to ['x0', 'x1', ...].

conninfpy.harmonize.flatten_upper(Y: ndarray[tuple[Any, ...], dtype[float64]]) → Tuple[ndarray[tuple[Any, ...], dtype[float64]], Tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], ndarray[tuple[Any, ...], dtype[_ScalarT]]], int][source]

Flatten the upper triangle (k=1) of a (n, N, N) connectivity stack.

Parameters:

Y (ndarray of shape (n_subjects, N, N)) – Symmetric connectivity matrices with zero diagonal.

Returns:

features (ndarray of shape (n_subjects, N*(N-1)/2)) – Flattened upper triangles — one feature per edge.
triu_idx (tuple of ndarray) – (rows, cols) index tuple usable as M[triu_idx] to invert.
N (int) – Side length of the original matrices.

conninfpy.harmonize.unflatten_upper(features: ndarray[tuple[Any, ...], dtype[float64]], triu_idx: Tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], ndarray[tuple[Any, ...], dtype[_ScalarT]]], N: int) → ndarray[tuple[Any, ...], dtype[float64]][source]: Reconstruct (n, N, N) symmetric matrices (zero diagonal) from flattened upper-triangle features.

conninfpy.harmonize.block_mass(p_full: ndarray[tuple[Any, ...], dtype[float64]], net_labels: ndarray[tuple[Any, ...], dtype[int64]], alpha: float = 0.05, *, return_upper: bool = True) → ndarray[tuple[Any, ...], dtype[int64]][source]

Aggregate edge-level survival counts into a network-block matrix.

For an upper-triangular set of significant edges (p <= alpha), count how many fall within each unordered pair of network labels (i, j). The diagonal counts within-network edges.

Parameters:

p_full (ndarray of shape (N, N)) – Edge-wise p-value matrix (typically symmetric, with diagonal=1).
net_labels (ndarray of shape (N,)) – Integer network assignment per node, 0..K-1. The Yeo-7 partition stored in abide_prepared.npz['net_labels'] is the canonical use case.
alpha (float, default 0.05) – Significance threshold; edges with p <= alpha are counted.
return_upper (bool, default True) – If True, the returned matrix is upper-triangular (lower triangle zero-filled). If False, the matrix is symmetric.

Returns:

M – Block-mass matrix where M[i, j] is the number of significant edges with one endpoint in network i and the other in j.

Return type:

ndarray of shape (K, K), int

Examples

>>> import numpy as np
>>> from conninfpy import block_mass
>>> p = np.array([[1, 0.01, 0.5], [0.01, 1, 0.6], [0.5, 0.6, 1]])
>>> labels = np.array([0, 0, 1])
>>> block_mass(p, labels, alpha=0.05)
array([[1, 0],
       [0, 0]])