gatac.pp.compute_metrics

gatac.pp.compute_metrics#

gatac.pp.compute_metrics(parquet_path, tss_source, window_size=2000, smooth_window=11, min_unique_frags=100, chrom_sizes=None, exclude_chroms=['chrM', 'M'], row_groups_per_batch=64)#

Compute TSS enrichment scores using GPU-accelerated streaming via row groups.

This function streams fragment data in batches of row groups to balance speed and GPU memory usage.

Parameters:
parquet_path str or Path

Path to the parquet file containing ATAC fragments.

tss_source str, Path, or pl.DataFrame

Either a path to a GTF file (the TSS table is loaded internally) or a pre-built polars.DataFrame of TSS positions (columns: chrom, tss, strand). When a pre-built DataFrame is passed, the underlying GTF file is not re-read.

window_size int

Distance around TSS to consider (default: 2000)

smooth_window int

Window size for smoothing the TSS signal (default: 11)

min_unique_frags int

Minimum unique fragments per cell to include in output (default: 100)

chrom_sizes dict[str, int] | None

If provided, only include fragments on these chromosomes.

exclude_chroms list[str] | None

Chromosomes to exclude from TSS enrichment calculation (default: [“chrM”, “M”])

row_groups_per_batch int

Number of parquet row groups to process in each GPU batch (default: 64)

Returns:

results : cudf.DataFrame DataFrame with columns: [‘barcode’, ‘tsse_score’, ‘n_unique’, ‘duplicate_fraction’, ‘mito_fraction’]

Return type:

cudf.DataFrame

Examples

>>> import gatac as ga
>>> # Pass a GTF path directly
>>> metrics = ga.pp.compute_metrics(
...     "pbmc.parquet",
...     "GRCh38.gtf.gz",
...     min_unique_frags=100,
...     exclude_chroms=["chrM", "M"],
... )
>>> metrics.columns.tolist()
['barcode', 'tsse_score', 'n_unique', 'duplicate_fraction', 'mito_fraction']