gatac.tl.motif_enrichment

gatac.tl.motif_enrichment#

gatac.tl.motif_enrichment(motifs, regions, genome_fasta, background=None, method=None, pvalue=1e-05, check_rc=True, bg_probs=(0.25, 0.25, 0.25, 0.25), motif_batch_size=16)#

Identify enriched transcription factor motifs using GPU acceleration.

This function scans genomic regions for motif matches and performs statistical enrichment testing against a background set.

Parameters:
motifs list[DNAMotif]

List of transcription factor motifs to test

regions dict[str, list[str]]

Groups of genomic regions to test. Keys are group names, values are lists of region strings in “chr:start-end” format. Each group is tested independently.

genome_fasta str or Path

Path to genome FASTA file for sequence extraction

background list[str] or dict[str, list[str]], optional

Background regions. Pass a single list to use one shared background for all groups, or a dict keyed like regions to use per-group matched backgrounds. If None, the union of all tested regions is used.

method {"binomial", "hypergeometric"}, optional

Statistical test method. If None, uses “hypergeometric” when background is None (subset testing), else “binomial”.

pvalue float, default 1e-5

P-value threshold for motif matching

check_rc bool, default True

Whether to check both strands (forward and reverse complement)

bg_probs {"auto", "subject", "even"} or tuple, default (0.25, 0.25, 0.25, 0.25)

Background nucleotide probabilities (A, C, G, T) used when converting motifs to log-odds scores and computing match thresholds. Use "auto" to estimate base frequencies from all scanned sequences (foreground plus any provided background), "subject" as an alias for the same behavior, "even" for a uniform background, or pass a custom 4-tuple.

motif_batch_size int, default 16

Number of motifs of the same length to process together on GPU. Higher values increase GPU parallelism but use more memory.

Returns:

dict[str, pd.DataFrame] Dictionary mapping group names to DataFrames with columns: - id: Motif ID - name: Motif name - family: Motif family - log2(fold change): Log2 fold enrichment - p-value: Raw p-value - adjusted p-value: BH-corrected p-value

Return type:

dict[str, DataFrame]

Examples

>>> import gatac
>>> motifs = gatac.tl.read_motifs("motifs.meme")
>>> regions = {
...     "cluster1": ["chr1:1000-1500", "chr1:5000-5500"],
...     "cluster2": ["chr2:2000-2500"],
... }
>>> results = gatac.tl.motif_enrichment(
...     motifs, regions, "genome.fa"
... )
>>> results["cluster1"]  # DataFrame with enrichment results
>>> matched_bg = gatac.tl.sample_gc_matched_background(
...     regions,
...     genome_fasta="genome.fa",
...     background_pool=all_peaks,
... )
>>> results = gatac.tl.motif_enrichment(
...     motifs,
...     regions,
...     "genome.fa",
...     background=matched_bg,
...     bg_probs="auto",
... )