conninfpy.synth_datasets

Synthetic connectivity datasets for benchmarking and testing.

conninfpy.synth_datasets.generate_fc_matrices(N, effect_size, mask=None, n_samples_group1=50, n_samples_group2=50, repeated_measures=False, seed=None)[source]

Generate synthetic functional connectivity correlation matrices for groupwise comparisons or repeated measures.

Parameters:

N (int) – Number of ROIs (regions of interest), i.e. an N x N matrix.
effect_size (float) – Magnitude of correlation difference between groups.
mask (np.ndarray, optional) – Binary mask (N, N) to apply correlation differences.
n_samples_group1 (int) – Number of matrices in group 1 (default: 50).
n_samples_group2 (int) – Number of matrices in group 2 (default: 50).
repeated_measures (bool) – If True, generate within-subject repeated-measures (paired) data, otherwise independent groups (default: False).
seed (int, optional) – Random seed for reproducibility.

Returns:

group1 (np.ndarray) – Array of FC matrices for group 1, shape (n_samples_group1, N, N).
group2 (np.ndarray) – Array of FC matrices for group 2, shape (n_samples_group2, N, N).
(base_cov, mod_cov) (tuple of np.ndarray) – Original covariance matrix and modified covariance matrix with the effect_size introduced; used for group 1 and group 2 matrices respectively.

Examples

>>> N = 6; e = 0.2; mask = np.zeros((N, N))
>>> mask[0:2, 0:2] = 1; mask[2:4, 2:4] = -1
>>> g1, g2, (c1, c2) = generate_fc_matrices(N, e, mask, 5, 10, seed=0)
>>> g1.shape == (5, 6, 6)
True
>>> g2.shape == (10, 6, 6)
True
>>> np.allclose(c1, c1.T)
True

conninfpy.synth_datasets.generate_multisite_glm_dataset(n_subjects: int = 30, N: int = 100, n_sites: int = 3, effect_size: float = 0.0, site_shift_sigma: float = 0.2, corr_site_interest: float = 0.0, n_signal_edges: int = 0, base_corr_sparsity: float = 0.9, seed: int | None = None) → dict[source]

Multi-site GLM connectivity dataset, sized to mimic a Schaefer-100 study.

Built for the v2.1 full-pipeline calibration tests (tests/test_full_pipeline.py) and for the matrix-level phase of the SC-prior validation work in Projects/NetworkStatistics/_wiki/pseudo_real_validation.md.

Output is in Fisher-z units already (so callers should pass fisher_z=False to analyze()). Site effects are additive on Fisher-z; signal is linear in the regressor of interest at a fixed set of signal edges.

Parameters:

n_subjects (int, default 30) – Total subjects, evenly distributed across n_sites.
N (int, default 100) – Number of ROIs. Default 100 mimics Schaefer-100.
n_sites (int, default 3) – Number of distinct sites; each contributes a fixed symmetric Fisher-z offset on all edges.
effect_size (float, default 0.0) – Slope of the linear-in-interest perturbation at signal edges. Set to 0.0 for an H₀ dataset (no group / regressor effect).
site_shift_sigma (float, default 0.2) – Scale of the per-site additive Fisher-z offset. Each site’s offset is drawn from N(0, σ_site²) once per simulation.
corr_site_interest (float in [0, 1], default 0.0) – Population correlation between sites (as an integer code) and the interest regressor. 0.0 = independent (the H₀-friendly regime); 0.6 = the regime where unrestricted permutation after harmonisation leaks Type-I.
n_signal_edges (int, default 0) – Number of edges carrying the planted signal. Ignored when effect_size == 0. Edges are drawn uniformly from the upper triangle (excluding diagonal).
base_corr_sparsity (float, default 0.9) – alpha for make_sparse_spd_matrix controlling baseline connectivity sparsity (higher → sparser).
seed (int, optional) – Reproducibility handle.

Returns:

"Y" — (n_subjects, N, N) Fisher-z connectivity tensor, symmetric with zero diagonal.
"interest" — (n_subjects,) regressor of interest.
"sites" — (n_subjects,) integer site labels.
"signal_mask" — (N, N) boolean upper-triangle mask of true positive edges (all False when effect_size == 0 or n_signal_edges == 0).

Return type:

dict with keys

Notes

Designed for end-to-end exercise of analyze(Y, interest=..., sites=...): it triggers the auto- preserve, auto-strata, ComBat-with-preserve, Freedman-Lane with within-block exchangeability path in a single call. Calibration tests use effect_size=0.0 (H₀); strata-vs-no-strata sanity tests use corr_site_interest > 0 to bring the leak into view.

Examples

>>> data = generate_multisite_glm_dataset(
...     n_subjects=24, N=20, n_sites=3,
...     site_shift_sigma=0.3, corr_site_interest=0.6, seed=0,
... )
>>> data["Y"].shape
(24, 20, 20)
>>> data["sites"].shape, data["interest"].shape
((24,), (24,))

class conninfpy.synth_datasets.ModularDatasetGenerator(N: int, n_modules: int = 5, intra_corr: float = 0.6, inter_corr: float = 0.1, noise_level: float = 0.05, seed: int = None)[source]

Bases: object

A generator for synthetic functional connectivity data with a modular (block) structure.

This class simulates brain connectivity matrices where nodes are organized into distinct functional modules (e.g., Visual, DMN, Motor). It allows for the injection of specific topological effects (within-module or between-module changes) to simulate pathological conditions.

get_mask_within_module(module_idx: int) → ndarray[source]: Returns a binary mask for all edges WITHIN a specific module.

get_mask_between_modules(module_idx_A: int, module_idx_B: int) → ndarray[source]: Returns a binary mask for all edges connecting module A and module B.

generate_data(effect_mask: ndarray, effect_size: float, n_samples_g1: int = 50, n_samples_g2: int = 50, time_points: int = 200)[source]

Generates sample correlation matrices for two groups.

Group 1 is sampled from the base modular covariance. Group 2 is sampled from a modified covariance (base + effect).

Parameters:

effect_mask (np.ndarray) –
Effect mask matrix of shape (N, N).
- 0 means “no effect” for that edge.
- Non-zero values scale the effect magnitude per edge.
- The sign of the value controls effect direction (positive/negative).
The matrix is treated as undirected: it will be symmetrized internally and its diagonal will be set to 0.
effect_size (float) – Magnitude of the effect (Cohen’s d-like shift in correlation). Positive values increase correlation, negative values decrease it.
n_samples_g1 (int) – Number of subjects in Group 1 (Control).
n_samples_g2 (int) – Number of subjects in Group 2 (Test).
time_points (int) – Length of the simulated BOLD time-series. Higher values reduce sampling noise.

Returns:

g1_data (np.ndarray (n_samples_g1, N, N)) – Correlation matrices for Group 1.
g2_data (np.ndarray (n_samples_g2, N, N)) – Correlation matrices for Group 2.
labels (np.ndarray (N,)) – Vector of module assignments for nodes.