Motif Enrichment#

Setup#

Download a MEME motif database file containing human transcription factor binding motifs from the Cis-BP resource, alongside the human genome in fasta format.

%%bash
[ -f data/cisBP_human.meme ] || wget -O data/cisBP_human.meme https://osf.io/download/uk6vn
[ -f data/GRCh38.fa ] || wget -O data/GRCh38.fa.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh38.p13.genome.fa.gz
[ -f data/GRCh38.fa ] || gunzip -c data/GRCh38.fa.gz > data/GRCh38.fa
import gatac as ga
import scanpy as sc
import anndata as ad
import matplotlib.pyplot as plt

sc.set_figure_params(dpi=100, frameon=False)

GENOME_FASTA = "data/GRCh38.fa"
MOTIF_FILE   = "data/cisBP_human.meme"

peak_adata = ad.read_h5ad("data/pbmc_peaks.h5ad")
print(peak_adata)
AnnData object with n_obs × n_vars = 5001 × 178350
    obs: 'barcode', 'n_unique', 'tsse_score', 'duplicate_fraction', 'mito_fraction', 'leiden'

1. Load motifs#

motifs = ga.tl.read_motifs(MOTIF_FILE, unique=True)
print(f"Loaded {len(motifs)} motifs")

# Inspect a single motif
m = motifs[0]
print(f"\nExample motif:")
print(f"  ID:     {m.id}")
print(f"  Name:   {m.name}")
print(f"  Family: {m.family}")
print(f"  Length: {m.pwm.shape[0]} bp")
print(f"  Info:   {m.info_content():.2f} bits")
Loaded 1165 motifs

Example motif:
  ID:     AC023509.3+M02872_2.00
  Name:   AC023509.3
  Family: None
  Length: 14 bp
  Info:   17.31 bits
peaks = ga.tl.marker_peaks(adata=peak_adata, groupby="leiden")

markers = {}
for leiden, df in peaks.items():
    markers[leiden] = df.loc[df.log2_fc > 0][:1000].feature.values.tolist()
15:30:58 - gatac.tl.markers - INFO - Running marker peak detection for groups in 'leiden'
15:30:58 - gatac.tl.markers - INFO - Testing 13 groups, 178,350 features, 5,001 cells
15:30:58 - gatac.tl.markers - INFO - Using max_cells=500 per group for balanced comparison
15:31:00 - gatac.tl.markers - INFO - Marker peak detection complete. Results stored in adata.uns['marker_peaks']

2. Run motif enrichment#

import time
import numpy as np

t0 = time.perf_counter()
enrichment_results = ga.tl.motif_enrichment(
    motifs,
    regions=markers,
    genome_fasta=GENOME_FASTA,
)
print(f"Motif enrichment in {time.perf_counter() - t0:.1f}s")
15:31:06 - gatac.tl.motif - INFO - Fetching 9823 sequences...
15:31:09 - gatac.tl.motif - INFO - Encoding sequences for GPU...
15:31:09 - gatac.tl.motif - INFO - Precomputing score thresholds for all motifs...
15:31:14 - gatac.tl.motif - INFO - Scanning 1165 motifs across 9823 regions...
15:31:29 - gatac.tl.motif - INFO - Motif enrichment analysis complete
Motif enrichment in 22.5s
enrichment_results["0"].sort_values("adjusted p-value").head()
id name family log2(fold change) p-value adjusted p-value
488 NFE2L2+M08789_2.00 NFE2L2 None 1.190888 0.000020 0.015983
60 CEBPA+M08076_2.00 CEBPA None 1.336208 0.000027 0.015983
712 SPIB+M03008_2.00 SPIB None 0.695308 0.000051 0.019878
31 BACH1+M09488_2.00 BACH1 None 1.058600 0.000084 0.024537
713 SPIC+M03005_2.00 SPIC None 0.707267 0.000262 0.060981

3. Visualize top enriched motifs#

# Show top 20 enriched motifs for a chosen cluster
cluster_id = "1"
df = enrichment_results[cluster_id]

top20 = df.nsmallest(20, "adjusted p-value")

fig, ax = plt.subplots(figsize=(6, 6))
sc = ax.scatter(
    -np.log10(top20["adjusted p-value"] + 1e-300),
    range(len(top20)),
    c=top20["log2(fold change)"],
    cmap="YlOrRd",
    s=60,
)
plt.colorbar(sc, ax=ax, label="log2(fold change)")
ax.set_yticks(range(len(top20)))
ax.set_yticklabels(top20["name"])
ax.set_xlabel("-log10(adjusted p-value)")
ax.set_title(f"Top 20 enriched TF motifs — Cluster {cluster_id}")
plt.tight_layout()
plt.show()

Next steps#

  • 04 – chromVAR: per-cell TF activity scores

  • 05 – GSEA: preranked GSEA on motif gene sets