gatac gene#
Build a cell × gene activity matrix from an ATAC-seq fragment Parquet file. Accessibility is scored by counting paired fragment insertions over promoter and gene-body regions defined by a GTF annotation.
Synopsis#
gatac gene <input.parquet> -g <annotations.gtf>
[-o OUTPUT] [--id-type TYPE]
[--upstream N] [--downstream N]
[--include-gene-body | --no-gene-body]
[-m MIN_FRAGS] [-e CHROMS ...]
[--metrics METRICS] [--filter QUERY]
[--barcode-prefix PREFIX] [--low-memory]
Arguments#
Positional#
Argument |
Description |
|---|---|
|
Path to the (filtered) fragment Parquet file |
Options#
Flag |
Default |
Description |
|---|---|---|
|
required |
GTF (or GFF3) gene annotation file |
|
|
Output h5ad path |
|
|
Aggregation level: |
|
|
Promoter extension upstream of TSS (bp) |
|
|
Extension downstream of TSS (bp) |
|
on |
Include gene body in scoring region |
|
— |
Score promoter region only |
|
|
Minimum unique fragments per barcode |
|
|
Chromosomes to exclude |
|
— |
Metrics CSV for quality-based filtering |
|
— |
Polars query string applied to metrics |
|
— |
String prepended to barcodes |
|
off |
Process one Parquet row-group at a time |
Counting strategy#
Gene activity is scored using the paired-insertion counting method:
For each fragment, both insertion sites (start + 1 and end) are considered.
Insertions overlapping the scoring region (promoter ± gene body) are counted per gene per barcode.
This is conceptually equivalent to the SnapATAC2 gene-activity approach.
Examples#
Basic usage#
gatac gene pbmc.parquet -g GRCh38.gtf.gz
With quality filtering#
gatac gene pbmc.parquet -g GRCh38.gtf.gz \
--metrics pbmc_metrics.csv \
--filter "tsse_score > 5" \
-o pbmc_gene.h5ad
Promoter-only scoring#
gatac gene pbmc.parquet -g GRCh38.gtf.gz \
--upstream 2000 --downstream 200 \
--no-gene-body \
-o pbmc_promoter.h5ad
Transcript-level aggregation#
gatac gene pbmc.parquet -g GRCh38.gtf.gz --id-type transcript
Python equivalent#
import gatac as ga
adata_gene = ga.pp.make_gene_matrix(
"pbmc.parquet",
gene_anno="GRCh38.gtf.gz",
id_type="gene",
upstream=2000,
downstream=0,
include_gene_body=True,
metrics="pbmc_metrics.csv",
filter_query="tsse_score > 5",
)
adata_gene.write_h5ad("pbmc_gene.h5ad")
Output AnnData structure#
Slot |
Content |
|---|---|
|
Sparse cell × gene count matrix |
|
Barcode metadata |
|
Gene metadata: |