gatac.pp.make_gene_matrix

gatac.pp.make_gene_matrix#

gatac.pp.make_gene_matrix(input_parquet, gene_anno, output_path=None, id_type='gene', upstream=2000, downstream=0, include_gene_body=True, min_fragments_per_cell=100, exclude_chroms=None, metrics=None, filter_query=None, barcode_prefix=None, low_memory=False, cell_batch_size=500, gene_name_key='gene_name', gene_id_key='gene_id', transcript_name_key='transcript_name', transcript_id_key='transcript_id')#

Process ATAC fragments parquet file and generate gene activity matrix.

Uses paired-insertion counting strategy matching SnapATAC2: each fragment contributes insertions at start and end positions. If both insertions fall within the same gene’s regulatory domain, count +1. If in different genes, each gene gets +1.

Parameters:

input_parquet str or Path: Path to input parquet file containing ATAC fragments
gene_anno str or Path: Path to GTF/GFF gene annotation file.
output_path str or Path, optional: Path for output .h5ad file. If None, uses input filename.
id_type str: “gene” or “transcript” - which feature type to use (default: “gene”).
upstream int: Base pairs upstream of TSS to include (default: 2000).
downstream int: Base pairs downstream of regulatory domain (default: 0).
include_gene_body bool: Whether to include the gene body in the regulatory domain (default: True).
min_fragments_per_cell int: Minimum fragments required per barcode (default: 100)
exclude_chroms list, optional: List of chromosomes to exclude. (default: None)
metrics str, Path, or cudf.DataFrame, optional: Path to a CSV file or a cuDF DataFrame containing cell metrics for filtering.
filter_query str, optional: Query string for filtering cells based on metrics (e.g. “tsse_score > 5”).
barcode_prefix str, optional: Prefix to add to barcodes
low_memory bool: Use low memory mode for Parquet reading (default: False)
cell_batch_size int: Number of cells to process per batch (default: 500). Lower values reduce GPU memory usage but may be slower.
gene_name_key str: Key for gene name in GTF attributes (default: “gene_name”).
gene_id_key str: Key for gene ID in GTF attributes (default: “gene_id”).
transcript_name_key str: Key for transcript name in GTF attributes (default: “transcript_name”).
transcript_id_key str: Key for transcript ID in GTF attributes (default: “transcript_id”).

Returns:

adata : AnnData AnnData object with gene activity matrix

Return type:

sc.AnnData

Examples

>>> import gatac as ga
>>> adata_gene = ga.pp.make_gene_matrix(
...     "pbmc_filtered.parquet",
...     gene_anno="GRCh38.gtf.gz",
...     id_type="gene",
...     upstream=2000,
...     downstream=0,
...     include_gene_body=True,
... )

gatac.pp.make_gene_matrix

Contents

gatac.pp.make_gene_matrix#