gatac genescore#
Build a cell × gene score matrix from an ATAC-seq fragment Parquet file.
This is a faithful GPU port of ArchR’s addGeneScoreMatrix: tile insertion
counts are weighted by their signed distance to each gene and an inverse
gene-width factor, then summed and per-cell normalised.
For the simpler SnapATAC2-style paired-insertion count, see
gatac gene instead.
Synopsis#
gatac genescore <input.parquet> -g <annotations.gtf>
[-o OUTPUT] [--gene-model EXPR] [--tile-size N]
[--extend-upstream MIN MAX] [--extend-downstream MIN MAX]
[--gene-upstream N] [--gene-downstream N]
[--no-gene-boundaries] [--use-tss]
[--ceiling N] [--gene-scale-factor F] [--scale-to F]
[-m MIN_FRAGS] [-e CHROMS ...]
[--metrics METRICS] [--filter QUERY]
[--barcode-prefix PREFIX] [--low-memory]
[--cell-batch-size N]
Arguments#
Positional#
Argument |
Description |
|---|---|
|
Path to the (filtered) fragment Parquet file |
Options#
Flag |
Default |
Description |
|---|---|---|
|
required |
GTF/GFF annotation, or CSV with columns |
|
|
Output h5ad path |
|
|
ArchR |
|
|
Tile size (bp) |
|
|
Min/max bp upstream extension of the regulatory window |
|
|
Min/max bp downstream extension of the regulatory window |
|
|
bp the gene body is grown upstream before the model is applied |
|
|
bp the gene body is grown downstream before the model is applied |
|
off |
Disable neighbouring-gene boundary clipping |
|
off |
Build the model on the 1bp TSS instead of the gene body |
|
|
Max insertions counted per tile (limits pileup bias) |
|
|
Inverse-gene-width weighting scale factor |
|
|
Per-cell normalisation target |
|
|
Minimum unique fragments per barcode |
|
|
Chromosomes to exclude |
|
— |
Metrics CSV for quality-based filtering |
|
— |
Polars query string applied to metrics |
|
— |
String prepended to barcodes |
|
off |
Process one Parquet row-group at a time |
|
— |
Process cells in column batches (lower GPU memory) |
Scoring strategy#
The gene score is a distance-weighted activity score, computed per chromosome:
Both Tn5 insertion ends of every fragment are binned into
--tile-sizetiles and accumulated per cell, capped at--ceiling.For each gene, an extended regulatory window is built (gene body grown by
--gene-upstream/--gene-downstream, then out to--extend-upstream/--extend-downstream), optionally clipped at neighbouring genes unless--no-gene-boundariesis set.Each (gene, tile) pair is weighted by
--gene-modelevaluated on the signed distance to the TSS, times a per-gene inverse-width weight.Gene scores are the weighted sum of tile counts, then each cell is normalised to
--scale-to.
This matches ArchR’s addGeneScoreMatrix defaults and output exactly.
Examples#
Basic usage#
gatac genescore pbmc.parquet -g GRCh38.gtf.gz
With quality filtering#
gatac genescore pbmc.parquet -g GRCh38.gtf.gz \
--metrics pbmc_metrics.csv \
--filter "tsse_score > 5" \
-o pbmc_gene_score.h5ad
TSS-centred model#
gatac genescore pbmc.parquet -g GRCh38.gtf.gz \
--use-tss --gene-model "exp(-abs(x)/5000) + exp(-1)"
Lower GPU memory#
gatac genescore pbmc.parquet -g GRCh38.gtf.gz \
--low-memory --cell-batch-size 5000
Python equivalent#
import gatac as ga
adata_score = ga.pp.make_gene_score_matrix(
"pbmc.parquet",
gene_anno="GRCh38.gtf.gz",
gene_model="exp(-abs(x)/5000) + exp(-1)",
tile_size=500,
extend_upstream=(1000, 100000),
extend_downstream=(1000, 100000),
metrics="pbmc_metrics.csv",
filter_query="tsse_score > 5",
)
adata_score.write_h5ad("pbmc_gene_score.h5ad")
Output AnnData structure#
Slot |
Content |
|---|---|
|
Sparse cell × gene normalised score matrix |
|
Barcode metadata |
|
Gene metadata: |