gatac genescore

`gatac genescore`#

Build a cell × gene score matrix from an ATAC-seq fragment Parquet file. This is a faithful GPU port of ArchR’s addGeneScoreMatrix: tile insertion counts are weighted by their signed distance to each gene and an inverse gene-width factor, then summed and per-cell normalised.

For the simpler SnapATAC2-style paired-insertion count, see gatac gene instead.

Synopsis#

gatac genescore <input.parquet> -g <annotations.gtf>
                [-o OUTPUT] [--gene-model EXPR] [--tile-size N]
                [--extend-upstream MIN MAX] [--extend-downstream MIN MAX]
                [--gene-upstream N] [--gene-downstream N]
                [--no-gene-boundaries] [--use-tss]
                [--ceiling N] [--gene-scale-factor F] [--scale-to F]
                [-m MIN_FRAGS] [-e CHROMS ...]
                [--metrics METRICS] [--filter QUERY]
                [--barcode-prefix PREFIX] [--low-memory]
                [--cell-batch-size N]

Arguments#

Positional#

Argument	Description
`input.parquet`	Path to the (filtered) fragment Parquet file

Options#

Flag	Default	Description
`-g`, `--gtf`	required	GTF/GFF annotation, or CSV with columns `symbol,seqnames,start,end,strand`
`-o`, `--output`	`<input>_gene_score_matrix.h5ad`	Output h5ad path
`--gene-model`	`exp(-abs(x)/5000) + exp(-1)`	ArchR `geneModel` expression in `x` (signed distance to TSS)
`--tile-size`	`500`	Tile size (bp)
`--extend-upstream`	`1000 100000`	Min/max bp upstream extension of the regulatory window
`--extend-downstream`	`1000 100000`	Min/max bp downstream extension of the regulatory window
`--gene-upstream`	`5000`	bp the gene body is grown upstream before the model is applied
`--gene-downstream`	`0`	bp the gene body is grown downstream before the model is applied
`--no-gene-boundaries`	off	Disable neighbouring-gene boundary clipping
`--use-tss`	off	Build the model on the 1bp TSS instead of the gene body
`--ceiling`	`4`	Max insertions counted per tile (limits pileup bias)
`--gene-scale-factor`	`5.0`	Inverse-gene-width weighting scale factor
`--scale-to`	`10000.0`	Per-cell normalisation target
`-m`, `--min-fragments`	`100`	Minimum unique fragments per barcode
`-e`, `--exclude-chroms`	`chrY chrM`	Chromosomes to exclude
`--metrics`	—	Metrics CSV for quality-based filtering
`--filter`	—	Polars query string applied to metrics
`--barcode-prefix`	—	String prepended to barcodes
`--low-memory`	off	Process one Parquet row-group at a time
`--cell-batch-size`	—	Process cells in column batches (lower GPU memory)

Scoring strategy#

The gene score is a distance-weighted activity score, computed per chromosome:

Both Tn5 insertion ends of every fragment are binned into --tile-size tiles and accumulated per cell, capped at --ceiling.
For each gene, an extended regulatory window is built (gene body grown by --gene-upstream/--gene-downstream, then out to --extend-upstream/--extend-downstream), optionally clipped at neighbouring genes unless --no-gene-boundaries is set.
Each (gene, tile) pair is weighted by --gene-model evaluated on the signed distance to the TSS, times a per-gene inverse-width weight.
Gene scores are the weighted sum of tile counts, then each cell is normalised to --scale-to.

This matches ArchR’s addGeneScoreMatrix defaults and output exactly.

Examples#

Basic usage#

gatac genescore pbmc.parquet -g GRCh38.gtf.gz

With quality filtering#

gatac genescore pbmc.parquet -g GRCh38.gtf.gz \
    --metrics pbmc_metrics.csv \
    --filter "tsse_score > 5" \
    -o pbmc_gene_score.h5ad

TSS-centred model#

gatac genescore pbmc.parquet -g GRCh38.gtf.gz \
    --use-tss --gene-model "exp(-abs(x)/5000) + exp(-1)"

Lower GPU memory#

gatac genescore pbmc.parquet -g GRCh38.gtf.gz \
    --low-memory --cell-batch-size 5000

Python equivalent#

import gatac as ga

adata_score = ga.pp.make_gene_score_matrix(
    "pbmc.parquet",
    gene_anno="GRCh38.gtf.gz",
    gene_model="exp(-abs(x)/5000) + exp(-1)",
    tile_size=500,
    extend_upstream=(1000, 100000),
    extend_downstream=(1000, 100000),
    metrics="pbmc_metrics.csv",
    filter_query="tsse_score > 5",
)
adata_score.write_h5ad("pbmc_gene_score.h5ad")

Output AnnData structure#

Slot	Content
`adata.X`	Sparse cell × gene normalised score matrix
`adata.obs`	Barcode metadata
`adata.var`	Gene metadata: `name`, `chrom`, `start`, `end`, `strand`, `gene_idx` (indexed by gene `name`)