gatac.tl.scan_motifs

Contents

gatac.tl.scan_motifs#

gatac.tl.scan_motifs(adata, motifs, genome_fasta, *, pvalue=5e-05, check_rc=True, bg='subject', mode='gatac', key_added='motif_match', peak_batch_size=50000, coordinate_system='0-based')#

Scan peaks for motif matches and create a sparse motif match matrix.

This function wraps the existing GPU motif scanning infrastructure from gatac.tl.motif and produces a boolean matrix indicating which peaks contain each motif.

Parameters:
adata AnnData

AnnData object with peak matrix. Peak names in adata.var_names should be in “chr:start-end” format.

motifs list[DNAMotif]

List of motifs to scan (from read_motifs or parse_meme)

genome_fasta str or Path

Path to genome FASTA file (supports .fa, .fasta, .fa.gz, .fasta.gz)

pvalue float, default 5e-5

P-value threshold for motif matching (matches R/motifmatchr default)

check_rc bool, default True

Whether to check both strands (forward and reverse complement)

bg str or tuple, default "subject"

Background nucleotide probabilities (A, C, G, T). Use "subject" to compute them from extracted peak sequences, which matches motifmatchr’s default. Use "even" for a uniform background (0.25, 0.25, 0.25, 0.25), or pass a custom 4-tuple.

key_added str, default "motif_match"

Key to store motif match matrix in adata.varm

peak_batch_size int, default 50000

Number of peaks to process at once on GPU. Reduce if running out of GPU memory.

coordinate_system {"0-based", "1-based"}, default "0-based"

Coordinate system of peak names in adata.var_names. "0-based" is BED-style half-open indexing [start, end) and is what GATAC peak callers produce, so sequences are extracted as genome[start:end]. "1-based" is the closed interval format used by R/GenomicRanges and chromVAR, so sequences are extracted as genome[start-1:end].

mode {"gatac", "motifmatchr"}, default "gatac"

Motif scoring mode. "gatac" uses the standard natural-log odds ln(p / bg). "motifmatchr" reproduces motifmatchr/scPrinter scoring with log2(p / 0.25) - (log2(0.25) - log2(bg)).

Returns:

None Adds to adata.varm[key_added] a sparse boolean matrix of shape (n_peaks, n_motifs) and stores motif names in adata.uns[“motif_name”].

Return type:

None

Examples

>>> import gatac as ga
>>> motifs = ga.tl.read_motifs("motifs.meme")
>>>
>>> # For GATAC-generated peaks (0-based BED format)
>>> ga.tl.scan_motifs(peak_adata, motifs, "genome.fa")
>>>
>>> # For R/chromVAR peaks (1-based GenomicRanges format)
>>> ga.tl.scan_motifs(
...     peak_adata, motifs, "genome.fa",
...     coordinate_system="1-based"
... )