gatac.tl.call_peaks

Contents

gatac.tl.call_peaks#

gatac.tl.call_peaks(adata, groupby, parquet_path, *, genome='hg38', q_thresh=0.05, d_treat=200, d_ctrl=10000, max_gap=30, peak_amp=200, fe_cutoff=1.0, key_added='gmacs', inplace=True, verbose=True, batch_size=10, filter_chromosomes=True)#

GPU-accelerated peak calling per cluster using gmacs algorithm.

Calls peaks for each group of cells defined by groupby, using the fragment data from parquet files. Similar to SnapATAC2’s macs3 function but using GPU-accelerated gmacs.

Parameters:
adata AnnData

Annotated data matrix (typically a tiled matrix). If source_file is in obs, it’s used to map cells to parquet files.

groupby str

Key in adata.obs containing cluster/group assignments.

parquet_path str or Path

Path to parquet file or directory containing parquet files. If directory, uses source_file obs to find specific files.

genome str or dict

Genome name (e.g., ‘hg38’, ‘mm10’) or dict of chromosome sizes. Used to compute genome_length for significance calculations and for chromosome filtering.

q_thresh float

Q-value threshold for peak calling (default: 0.05)

d_treat int

Treatment extension distance in bp (default: 200)

d_ctrl int

Large local control extension distance in bp (default: 10000). Corresponds to MACS3’s –llocal parameter.

max_gap int

Maximum gap for merging peaks (default: 30)

peak_amp int

Minimum peak amplitude in bp (default: 200)

fe_cutoff float

Fold enrichment cutoff (default: 1.0) Peaks with signal_value < fe_cutoff are filtered out

key_added str

Key in adata.uns where peaks will be stored (default: ‘gmacs’)

inplace bool

If True, store results in adata.uns[key_added] (default: True)

verbose bool

If True, show tqdm progress bars (default: True)

batch_size int

Number of parquet row groups to load at once (default: 10). Increase for faster processing, decrease to reduce memory usage.

filter_chromosomes bool

If True, only keep fragments from chromosomes recognized by the provided genome (keys if genome is a dict, or standard chromosomes for the named genome) (default: True).

Returns:

dict[str, pd.DataFrame] or None If inplace=False, returns dict mapping group names to peak DataFrames. Each DataFrame has columns: chrom, start, end, name, score, strand, signal_value, p_value, q_value, peak.

Return type:

Optional[dict]

Examples

>>> import gatac
>>> # Load tiled AnnData with cluster assignments
>>> adata = sc.read_h5ad("tiled_matrix.h5ad")
>>> # Call peaks per cluster (default settings for ATAC-seq)
>>> gatac.tl.call_peaks(
...     adata,
...     groupby="leiden",
...     parquet_path="/path/to/fragments.parquet",
... )
>>> # Access peaks for cluster "0"
>>> peaks_cluster0 = adata.uns["gmacs"]["0"]