gatac metrics#

Compute per-barcode quality metrics from a fragment Parquet file. All computation is GPU-accelerated using streaming row-group processing, so files larger than GPU VRAM are handled transparently.


Synopsis#

gatac metrics <input.parquet> -g <annotations.gtf>
              [-o OUTPUT] [--min-frags N]
              [--batch-size N] [--memory-resource RESOURCE]

Arguments#

Positional#

Argument

Description

input.parquet

Path to the fragment Parquet file

Options#

Flag

Default

Description

-g, --gtf

required

GTF annotation file (used to extract TSS positions)

-o, --output

<input>_metrics.csv

Output CSV path

--min-frags

100

Minimum unique fragments; barcodes below this are excluded

--batch-size

64

Parquet row-groups processed per GPU batch

--memory-resource

cuda-async

RAPIDS memory resource (cuda-async, managed, managed-pool, cuda)


Computed metrics#

Column

Description

barcode

Cell barcode

tsse_score

TSS enrichment score (signal/background ratio)

n_unique

Number of unique (deduplicated) fragments

duplicate_fraction

Fraction of total reads that are duplicates

mito_fraction

Fraction of fragments mapping to mitochondrial contigs


Examples#

Basic usage#

gatac metrics pbmc.parquet -g GRCh38.gtf.gz -o pbmc_metrics.csv

Adjust batch size for limited VRAM#

gatac metrics pbmc.parquet -g GRCh38.gtf.gz --batch-size 32

Use managed memory (helpful for very large files)#

gatac metrics pbmc.parquet -g GRCh38.gtf.gz \
    --memory-resource managed-pool

Python equivalent#

import gatac as ga

metrics = ga.pp.compute_metrics(
    "pbmc.parquet",
    "GRCh38.gtf.gz",
    min_unique_frags=100,
    row_groups_per_batch=64,
)
metrics.to_pandas().to_csv("pbmc_metrics.csv", index=False)

Downstream use#

The output CSV is accepted by gatac filter, gatac tile, and gatac gene via the --metrics flag for on-the-fly quality filtering:

gatac filter pbmc.parquet \
    --metrics pbmc_metrics.csv \
    --filter "tsse_score > 5 and n_unique > 1000" \
    -o pbmc_filtered.parquet