gatac.pp.compute_metrics

gatac.pp.compute_metrics#

gatac.pp.compute_metrics(parquet_path, tss_source, window_size=2000, smooth_window=11, min_unique_frags=100, chrom_sizes=None, exclude_chroms=['chrM', 'M'], row_groups_per_batch=64)#

Compute TSS enrichment scores using GPU-accelerated streaming via row groups.

This function streams fragment data in batches of row groups to balance speed and GPU memory usage.

Parameters:

parquet_path str or Path: Path to the parquet file containing ATAC fragments.
tss_source str, Path, or pl.DataFrame: Either a path to a GTF file (the TSS table is loaded internally) or a pre-built polars.DataFrame of TSS positions (columns: chrom, tss, strand). When a pre-built DataFrame is passed, the underlying GTF file is not re-read.
window_size int: Distance around TSS to consider (default: 2000)
smooth_window int: Window size for smoothing the TSS signal (default: 11)
min_unique_frags int: Minimum unique fragments per cell to include in output (default: 100)
chrom_sizes dict[str, int] | None: If provided, only include fragments on these chromosomes.
exclude_chroms list[str] | None: Chromosomes to exclude from TSS enrichment calculation (default: [“chrM”, “M”])
row_groups_per_batch int: Number of parquet row groups to process in each GPU batch (default: 64)

Returns:

results : cudf.DataFrame DataFrame with columns: [‘barcode’, ‘tsse_score’, ‘n_unique’, ‘duplicate_fraction’, ‘mito_fraction’]

Return type:

cudf.DataFrame

Examples

>>> import gatac as ga
>>> # Pass a GTF path directly
>>> metrics = ga.pp.compute_metrics(
...     "pbmc.parquet",
...     "GRCh38.gtf.gz",
...     min_unique_frags=100,
...     exclude_chroms=["chrM", "M"],
... )
>>> metrics.columns.tolist()
['barcode', 'tsse_score', 'n_unique', 'duplicate_fraction', 'mito_fraction']

gatac.pp.compute_metrics

Contents

gatac.pp.compute_metrics#