gatac.tl.motif_enrichment#
- gatac.tl.motif_enrichment(motifs, regions, genome_fasta, background=None, method=None, pvalue=1e-05, check_rc=True, bg_probs=(0.25, 0.25, 0.25, 0.25), motif_batch_size=16)#
Identify enriched transcription factor motifs using GPU acceleration.
This function scans genomic regions for motif matches and performs statistical enrichment testing against a background set.
- Parameters:
- motifs
list[DNAMotif] List of transcription factor motifs to test
- regions
dict[str,list[str]] Groups of genomic regions to test. Keys are group names, values are lists of region strings in “chr:start-end” format. Each group is tested independently.
- genome_fasta
strorPath Path to genome FASTA file for sequence extraction
- background
list[str]ordict[str,list[str]], optional Background regions. Pass a single list to use one shared background for all groups, or a dict keyed like regions to use per-group matched backgrounds. If None, the union of all tested regions is used.
- method
{"binomial", "hypergeometric"}, optional Statistical test method. If None, uses “hypergeometric” when background is None (subset testing), else “binomial”.
- pvalue
float, default1e-5 P-value threshold for motif matching
- check_rc
bool, defaultTrue Whether to check both strands (forward and reverse complement)
- bg_probs
{"auto", "subject", "even"}ortuple, default(0.25,0.25,0.25,0.25) Background nucleotide probabilities (A, C, G, T) used when converting motifs to log-odds scores and computing match thresholds. Use
"auto"to estimate base frequencies from all scanned sequences (foreground plus any provided background),"subject"as an alias for the same behavior,"even"for a uniform background, or pass a custom 4-tuple.- motif_batch_size
int, default16 Number of motifs of the same length to process together on GPU. Higher values increase GPU parallelism but use more memory.
- motifs
- Returns:
dict[str, pd.DataFrame] Dictionary mapping group names to DataFrames with columns: - id: Motif ID - name: Motif name - family: Motif family - log2(fold change): Log2 fold enrichment - p-value: Raw p-value - adjusted p-value: BH-corrected p-value
- Return type:
Examples
>>> import gatac >>> motifs = gatac.tl.read_motifs("motifs.meme") >>> regions = { ... "cluster1": ["chr1:1000-1500", "chr1:5000-5500"], ... "cluster2": ["chr2:2000-2500"], ... } >>> results = gatac.tl.motif_enrichment( ... motifs, regions, "genome.fa" ... ) >>> results["cluster1"] # DataFrame with enrichment results >>> matched_bg = gatac.tl.sample_gc_matched_background( ... regions, ... genome_fasta="genome.fa", ... background_pool=all_peaks, ... ) >>> results = gatac.tl.motif_enrichment( ... motifs, ... regions, ... "genome.fa", ... background=matched_bg, ... bg_probs="auto", ... )