gatac.tl.gsea_motif_enrichment#
- gatac.tl.gsea_motif_enrichment(adata, rankings, logfc_col='log2fc', *, motif_key='motif_match', permutation_num=1000, min_size=15, max_size=2000, seed=42, threads=1, backend='gpu', gs_batch_size=4)#
Run preranked GSEA to identify enriched TF motifs from a LogFC-ranked peak list.
Unlike Fisher/hypergeometric motif enrichment, GSEA does not require a hard significance threshold to define “marker peaks”. Instead it ranks all peaks by log2 fold change and asks whether motif-containing peaks cluster at the top (or bottom) of that ranking. This provides statistical power even in shallow ATAC-seq data where no individual peak may reach significance.
Motif gene sets are built from the binary peak×motif matrix stored at
adata.varm[motif_key](populated bygatac.tl.scan_motifs()).- Parameters:
- adata
AnnData Peak-level AnnData object with
varm[motif_key](sparse bool matrix of shape n_peaks × n_motifs) anduns["motif_name"](list of motif names). Populated byga.tl.scan_motifs.- rankings
pd.DataFrameordict[str,pd.DataFrame] Per-peak ranking table(s).
Single DataFrame – index must be peak names matching
adata.var_names;logfc_colspecifies the log2FC column. Returns a singlepandas.DataFrame.Dict of DataFrames – keys are group labels; each value is a DataFrame as above. Returns
dict[str, pandas.DataFrame].
- logfc_col
str, default :py:class:``”log2fc”:py:class:`` Name of the column in each DataFrame that contains log2 fold change values. Peaks are ranked descending by this column before GSEA.
- motif_key
str, default :py:class:``”motif_match”:py:class:`` Key in
adata.varmthat stores the peak×motif binary matrix.- permutation_num
int, default1000 Number of GSEA permutations. Increase for more precise FDR estimates.
- min_size
int, default15 Minimum number of ranked peaks a motif gene set must contain (after intersection with the ranked list) to be tested. Smaller sets yield noisy NES estimates.
- max_size
int, default2000 Maximum motif gene set size.
- seed
int, default42 Random seed for permutation reproducibility.
- threads
int, default1 Number of threads passed to
gseapy.prerank(only used whenbackend="gseapy").- backend
{"gpu", "gseapy"}, default"gpu" Which backend to use for the enrichment score computation.
"gpu"– CuPy-based GPU implementation. Much faster for large numbers of motifs (10-50× speedup). Requires a CUDA-capable GPU."gseapy"– Delegates togseapy.prerank(Rust backend). No GPU required.
- gs_batch_size
int, default4 Number of gene sets processed simultaneously on the GPU per kernel call (GPU backend only). Larger values increase throughput at the cost of more VRAM. Reduce if you encounter out-of-memory errors.
- adata
- Returns:
pandas.DataFrame or dict[str, pandas.DataFrame] GSEA results with columns:
motif– motif nameNES– normalised enrichment score (positive = enriched at the top / high-logFC end)pval– nominal p-valuefdr– FDR q-value (Benjamini–Hochberg)lead_edge_n– number of peaks in the leading edgeset_size– number of ranked peaks containing the motiflead_edge_frac–lead_edge_n / set_size; fraction of the motif set in the leading edge (0–1)
Sorted descending by NES. Returns a single DataFrame when rankings is a single DataFrame, or a dict when it is a dict.
- Raises:
ImportError – If
gseapyis not installed (whenbackend="gseapy").KeyError – If
motif_keyis not found inadata.varm, orlogfc_colis not found in a rankings DataFrame.
- Return type:
Examples
>>> import gatac as ga >>> import pandas as pd
>>> # rankings is a DataFrame with peaks as index and a 'log2fc' column >>> ranked = pd.DataFrame({"log2fc": logfc_values}, index=peak_names) >>> result = ga.tl.gsea_motif_enrichment(peak_adata, ranked)
>>> # Multiple groups at once >>> group_rankings = { ... "CD4_Memory": pd.DataFrame({"log2fc": logfc_cd4}, index=peak_names), ... "NK": pd.DataFrame({"log2fc": logfc_nk}, index=peak_names), ... } >>> results = ga.tl.gsea_motif_enrichment(peak_adata, group_rankings) >>> results["NK"].head()