gatac.tl.gsea_motif_enrichment

gatac.tl.gsea_motif_enrichment#

gatac.tl.gsea_motif_enrichment(adata, rankings, logfc_col='log2fc', *, motif_key='motif_match', permutation_num=1000, min_size=15, max_size=2000, seed=42, threads=1, backend='gpu', gs_batch_size=4)#

Run preranked GSEA to identify enriched TF motifs from a LogFC-ranked peak list.

Unlike Fisher/hypergeometric motif enrichment, GSEA does not require a hard significance threshold to define “marker peaks”. Instead it ranks all peaks by log2 fold change and asks whether motif-containing peaks cluster at the top (or bottom) of that ranking. This provides statistical power even in shallow ATAC-seq data where no individual peak may reach significance.

Motif gene sets are built from the binary peak×motif matrix stored at adata.varm[motif_key] (populated by gatac.tl.scan_motifs()).

Parameters:
adata AnnData

Peak-level AnnData object with varm[motif_key] (sparse bool matrix of shape n_peaks × n_motifs) and uns["motif_name"] (list of motif names). Populated by ga.tl.scan_motifs.

rankings pd.DataFrame or dict[str, pd.DataFrame]

Per-peak ranking table(s).

  • Single DataFrame – index must be peak names matching adata.var_names; logfc_col specifies the log2FC column. Returns a single pandas.DataFrame.

  • Dict of DataFrames – keys are group labels; each value is a DataFrame as above. Returns dict[str, pandas.DataFrame].

logfc_col str, default :py:class:``”log2fc”:py:class:``

Name of the column in each DataFrame that contains log2 fold change values. Peaks are ranked descending by this column before GSEA.

motif_key str, default :py:class:``”motif_match”:py:class:``

Key in adata.varm that stores the peak×motif binary matrix.

permutation_num int, default 1000

Number of GSEA permutations. Increase for more precise FDR estimates.

min_size int, default 15

Minimum number of ranked peaks a motif gene set must contain (after intersection with the ranked list) to be tested. Smaller sets yield noisy NES estimates.

max_size int, default 2000

Maximum motif gene set size.

seed int, default 42

Random seed for permutation reproducibility.

threads int, default 1

Number of threads passed to gseapy.prerank (only used when backend="gseapy").

backend {"gpu", "gseapy"}, default "gpu"

Which backend to use for the enrichment score computation.

  • "gpu" – CuPy-based GPU implementation. Much faster for large numbers of motifs (10-50× speedup). Requires a CUDA-capable GPU.

  • "gseapy" – Delegates to gseapy.prerank (Rust backend). No GPU required.

gs_batch_size int, default 4

Number of gene sets processed simultaneously on the GPU per kernel call (GPU backend only). Larger values increase throughput at the cost of more VRAM. Reduce if you encounter out-of-memory errors.

Returns:

pandas.DataFrame or dict[str, pandas.DataFrame] GSEA results with columns:

  • motif – motif name

  • NES – normalised enrichment score (positive = enriched at the top / high-logFC end)

  • pval – nominal p-value

  • fdr – FDR q-value (Benjamini–Hochberg)

  • lead_edge_n – number of peaks in the leading edge

  • set_size – number of ranked peaks containing the motif

  • lead_edge_fraclead_edge_n / set_size; fraction of the motif set in the leading edge (0–1)

Sorted descending by NES. Returns a single DataFrame when rankings is a single DataFrame, or a dict when it is a dict.

Raises:
  • ImportError – If gseapy is not installed (when backend="gseapy").

  • KeyError – If motif_key is not found in adata.varm, or logfc_col is not found in a rankings DataFrame.

Return type:

DataFrame | dict[str, DataFrame]

Examples

>>> import gatac as ga
>>> import pandas as pd
>>> # rankings is a DataFrame with peaks as index and a 'log2fc' column
>>> ranked = pd.DataFrame({"log2fc": logfc_values}, index=peak_names)
>>> result = ga.tl.gsea_motif_enrichment(peak_adata, ranked)
>>> # Multiple groups at once
>>> group_rankings = {
...     "CD4_Memory": pd.DataFrame({"log2fc": logfc_cd4}, index=peak_names),
...     "NK":         pd.DataFrame({"log2fc": logfc_nk},  index=peak_names),
... }
>>> results = ga.tl.gsea_motif_enrichment(peak_adata, group_rankings)
>>> results["NK"].head()