gatac.tl.scan_motifs#
- gatac.tl.scan_motifs(adata, motifs, genome_fasta, *, pvalue=5e-05, check_rc=True, bg='subject', mode='gatac', key_added='motif_match', peak_batch_size=50000, coordinate_system='0-based')#
Scan peaks for motif matches and create a sparse motif match matrix.
This function wraps the existing GPU motif scanning infrastructure from gatac.tl.motif and produces a boolean matrix indicating which peaks contain each motif.
- Parameters:
- adata
AnnData AnnData object with peak matrix. Peak names in adata.var_names should be in “chr:start-end” format.
- motifs
list[DNAMotif] List of motifs to scan (from read_motifs or parse_meme)
- genome_fasta
strorPath Path to genome FASTA file (supports .fa, .fasta, .fa.gz, .fasta.gz)
- pvalue
float, default5e-5 P-value threshold for motif matching (matches R/motifmatchr default)
- check_rc
bool, defaultTrue Whether to check both strands (forward and reverse complement)
- bg
strortuple, default"subject" Background nucleotide probabilities
(A, C, G, T). Use"subject"to compute them from extracted peak sequences, which matches motifmatchr’s default. Use"even"for a uniform background(0.25, 0.25, 0.25, 0.25), or pass a custom 4-tuple.- key_added
str, default"motif_match" Key to store motif match matrix in adata.varm
- peak_batch_size
int, default50000 Number of peaks to process at once on GPU. Reduce if running out of GPU memory.
- coordinate_system
{"0-based", "1-based"}, default"0-based" Coordinate system of peak names in
adata.var_names."0-based"is BED-style half-open indexing[start, end)and is what GATAC peak callers produce, so sequences are extracted asgenome[start:end]."1-based"is the closed interval format used by R/GenomicRanges and chromVAR, so sequences are extracted asgenome[start-1:end].- mode
{"gatac", "motifmatchr"}, default"gatac" Motif scoring mode.
"gatac"uses the standard natural-log oddsln(p / bg)."motifmatchr"reproduces motifmatchr/scPrinter scoring withlog2(p / 0.25) - (log2(0.25) - log2(bg)).
- adata
- Returns:
None Adds to adata.varm[key_added] a sparse boolean matrix of shape (n_peaks, n_motifs) and stores motif names in adata.uns[“motif_name”].
- Return type:
None
Examples
>>> import gatac as ga >>> motifs = ga.tl.read_motifs("motifs.meme") >>> >>> # For GATAC-generated peaks (0-based BED format) >>> ga.tl.scan_motifs(peak_adata, motifs, "genome.fa") >>> >>> # For R/chromVAR peaks (1-based GenomicRanges format) >>> ga.tl.scan_motifs( ... peak_adata, motifs, "genome.fa", ... coordinate_system="1-based" ... )