gatac.pp.make_gene_matrix

gatac.pp.make_gene_matrix#

gatac.pp.make_gene_matrix(input_parquet, gene_anno, output_path=None, id_type='gene', upstream=2000, downstream=0, include_gene_body=True, min_fragments_per_cell=100, exclude_chroms=None, metrics=None, filter_query=None, barcode_prefix=None, low_memory=False, cell_batch_size=500, gene_name_key='gene_name', gene_id_key='gene_id', transcript_name_key='transcript_name', transcript_id_key='transcript_id')#

Process ATAC fragments parquet file and generate gene activity matrix.

Uses paired-insertion counting strategy matching SnapATAC2: each fragment contributes insertions at start and end positions. If both insertions fall within the same gene’s regulatory domain, count +1. If in different genes, each gene gets +1.

Parameters:
input_parquet str or Path

Path to input parquet file containing ATAC fragments

gene_anno str or Path

Path to GTF/GFF gene annotation file.

output_path str or Path, optional

Path for output .h5ad file. If None, uses input filename.

id_type str

“gene” or “transcript” - which feature type to use (default: “gene”).

upstream int

Base pairs upstream of TSS to include (default: 2000).

downstream int

Base pairs downstream of regulatory domain (default: 0).

include_gene_body bool

Whether to include the gene body in the regulatory domain (default: True).

min_fragments_per_cell int

Minimum fragments required per barcode (default: 100)

exclude_chroms list, optional

List of chromosomes to exclude. (default: None)

metrics str, Path, or cudf.DataFrame, optional

Path to a CSV file or a cuDF DataFrame containing cell metrics for filtering.

filter_query str, optional

Query string for filtering cells based on metrics (e.g. “tsse_score > 5”).

barcode_prefix str, optional

Prefix to add to barcodes

low_memory bool

Use low memory mode for Parquet reading (default: False)

cell_batch_size int

Number of cells to process per batch (default: 500). Lower values reduce GPU memory usage but may be slower.

gene_name_key str

Key for gene name in GTF attributes (default: “gene_name”).

gene_id_key str

Key for gene ID in GTF attributes (default: “gene_id”).

transcript_name_key str

Key for transcript name in GTF attributes (default: “transcript_name”).

transcript_id_key str

Key for transcript ID in GTF attributes (default: “transcript_id”).

Returns:

adata : AnnData AnnData object with gene activity matrix

Return type:

sc.AnnData

Examples

>>> import gatac as ga
>>> adata_gene = ga.pp.make_gene_matrix(
...     "pbmc_filtered.parquet",
...     gene_anno="GRCh38.gtf.gz",
...     id_type="gene",
...     upstream=2000,
...     downstream=0,
...     include_gene_body=True,
... )