gatac.pp.compute_metrics#
- gatac.pp.compute_metrics(parquet_path, tss_source, window_size=2000, smooth_window=11, min_unique_frags=100, chrom_sizes=None, exclude_chroms=['chrM', 'M'], row_groups_per_batch=64)#
Compute TSS enrichment scores using GPU-accelerated streaming via row groups.
This function streams fragment data in batches of row groups to balance speed and GPU memory usage.
- Parameters:
- parquet_path
strorPath Path to the parquet file containing ATAC fragments.
- tss_source
str,Path, orpl.DataFrame Either a path to a GTF file (the TSS table is loaded internally) or a pre-built
polars.DataFrameof TSS positions (columns:chrom,tss,strand). When a pre-built DataFrame is passed, the underlying GTF file is not re-read.- window_size
int Distance around TSS to consider (default: 2000)
- smooth_window
int Window size for smoothing the TSS signal (default: 11)
- min_unique_frags
int Minimum unique fragments per cell to include in output (default: 100)
- chrom_sizes
dict[str,int] | None If provided, only include fragments on these chromosomes.
- exclude_chroms
list[str] | None Chromosomes to exclude from TSS enrichment calculation (default: [“chrM”, “M”])
- row_groups_per_batch
int Number of parquet row groups to process in each GPU batch (default: 64)
- parquet_path
- Returns:
results : cudf.DataFrame DataFrame with columns: [‘barcode’, ‘tsse_score’, ‘n_unique’, ‘duplicate_fraction’, ‘mito_fraction’]
- Return type:
cudf.DataFrame
Examples
>>> import gatac as ga >>> # Pass a GTF path directly >>> metrics = ga.pp.compute_metrics( ... "pbmc.parquet", ... "GRCh38.gtf.gz", ... min_unique_frags=100, ... exclude_chroms=["chrM", "M"], ... ) >>> metrics.columns.tolist() ['barcode', 'tsse_score', 'n_unique', 'duplicate_fraction', 'mito_fraction']