gatac.tl.marker_peaks

Contents

gatac.tl.marker_peaks#

gatac.tl.marker_peaks(adata, groupby, groups=None, reference='rest', max_cells=500, min_pct=0.05, min_log2_fc=1.0, use_raw=False, key_added='marker_peaks', seed=42)#

GPU-accelerated marker peak detection using binomial test.

Identifies differentially accessible peaks between cell groups using the binomial test on binarized accessibility data. This approach follows ArchR’s getMarkerFeatures with testMethod=”binomial”.

For each group, compares the proportion of cells with accessible peaks in the foreground group vs the background (all other cells or a specific reference group).

Parameters:
adata AnnData

Annotated data matrix with cells × peaks.

groupby str

Column name in adata.obs containing group labels.

groups str or list of str, optional

Groups to test. If None, test all groups in groupby.

reference str

Reference group for comparison: - “rest”: compare each group against all other cells (default) - group name: compare each group against the specified group

max_cells int

Maximum number of cells to sample from each group (foreground and background). This controls test sensitivity - larger values detect smaller differences but may flag many biologically irrelevant peaks. Default: 500 (following ArchR).

min_pct float

Minimum fraction of cells (0-1) with accessible peak in either group to include the peak in results. Default: 0.05

min_log2_fc float

Minimum absolute log2 fold change threshold for results. Default: 1.0

use_raw bool

If True, use adata.raw.X. Default: False

key_added str

Key to store results in adata.uns. Default: “marker_peaks”

seed int

Random seed for cell subsampling reproducibility. Default: 42

Returns:

dict[str, pd.DataFrame] Dictionary mapping group names to DataFrames with columns: - “feature”: Peak/feature name - “log2_fc”: Log2 fold change (foreground vs background) - “mean_fg”: Mean accessibility in foreground (proportion of cells) - “mean_bg”: Mean accessibility in background - “mean_diff”: mean_fg - mean_bg - “p_value”: Raw p-value from two-sided binomial test - “fdr”: Benjamini-Hochberg adjusted p-value

Return type:

dict[str, DataFrame]

Notes

The binomial test compares the observed count of accessible cells in the foreground to what would be expected under the null hypothesis that the foreground has the same accessibility rate as the background.

Important: The test is highly sensitive to sample size. With many cells, even tiny differences become statistically significant. The max_cells parameter (default 500, following ArchR) subsamples both foreground and background to control sensitivity. Adjust min_log2_fc and min_pct to focus on biologically meaningful differences.

Two-sided test is performed (following ArchR): - If fg_rate >= bg_rate: tests for enrichment - If fg_rate < bg_rate: tests for depletion

Results are sorted by FDR (ascending) then by absolute log2_fc (descending).

Examples

>>> import gatac as ga
>>> # Find marker peaks for all clusters
>>> results = ga.tl.marker_peaks(adata, groupby="cluster")
>>>
>>> # Get top markers for a specific cluster
>>> cd8_markers = results["CD8_T"].filter(pl.col("fdr") < 0.05).head(100)
>>>
>>> # For more sensitive detection (more hits), increase max_cells:
>>> results = ga.tl.marker_peaks(adata, groupby="cluster", max_cells=1000, min_log2_fc=0.5)