gatac gene

Contents

`gatac gene`#

Build a cell × gene activity matrix from an ATAC-seq fragment Parquet file. Accessibility is scored by counting paired fragment insertions over promoter and gene-body regions defined by a GTF annotation.

Synopsis#

gatac gene <input.parquet> -g <annotations.gtf>
           [-o OUTPUT] [--id-type TYPE]
           [--upstream N] [--downstream N]
           [--include-gene-body | --no-gene-body]
           [-m MIN_FRAGS] [-e CHROMS ...]
           [--metrics METRICS] [--filter QUERY]
           [--barcode-prefix PREFIX] [--low-memory]

Arguments#

Positional#

Argument	Description
`input.parquet`	Path to the (filtered) fragment Parquet file

Options#

Flag	Default	Description
`-g`, `--gtf`	required	GTF (or GFF3) gene annotation file
`-o`, `--output`	`<input>_gene.h5ad`	Output h5ad path
`--id-type`	`gene`	Aggregation level: `gene` or `transcript`
`--upstream`	`2000`	Promoter extension upstream of TSS (bp)
`--downstream`	`0`	Extension downstream of TSS (bp)
`--include-gene-body`	on	Include gene body in scoring region
`--no-gene-body`	—	Score promoter region only
`-m`, `--min-fragments`	`100`	Minimum unique fragments per barcode
`-e`, `--exclude-chroms`	`chrM M`	Chromosomes to exclude
`--metrics`	—	Metrics CSV for quality-based filtering
`--filter`	—	Polars query string applied to metrics
`--barcode-prefix`	—	String prepended to barcodes
`--low-memory`	off	Process one Parquet row-group at a time

Counting strategy#

Gene activity is scored using the paired-insertion counting method:

For each fragment, both insertion sites (start + 1 and end) are considered.
Insertions overlapping the scoring region (promoter ± gene body) are counted per gene per barcode.

This is conceptually equivalent to the SnapATAC2 gene-activity approach.

Examples#

Basic usage#

gatac gene pbmc.parquet -g GRCh38.gtf.gz

With quality filtering#

gatac gene pbmc.parquet -g GRCh38.gtf.gz \
    --metrics pbmc_metrics.csv \
    --filter "tsse_score > 5" \
    -o pbmc_gene.h5ad

Promoter-only scoring#

gatac gene pbmc.parquet -g GRCh38.gtf.gz \
    --upstream 2000 --downstream 200 \
    --no-gene-body \
    -o pbmc_promoter.h5ad

Transcript-level aggregation#

gatac gene pbmc.parquet -g GRCh38.gtf.gz --id-type transcript

Python equivalent#

import gatac as ga

adata_gene = ga.pp.make_gene_matrix(
    "pbmc.parquet",
    gene_anno="GRCh38.gtf.gz",
    id_type="gene",
    upstream=2000,
    downstream=0,
    include_gene_body=True,
    metrics="pbmc_metrics.csv",
    filter_query="tsse_score > 5",
)
adata_gene.write_h5ad("pbmc_gene.h5ad")

Output AnnData structure#

Slot	Content
`adata.X`	Sparse cell × gene count matrix
`adata.obs`	Barcode metadata
`adata.var`	Gene metadata: `gene_name`, `gene_id`, `chrom`, `start`, `end`