gatac.pp.make_gene_matrix#
- gatac.pp.make_gene_matrix(input_parquet, gene_anno, output_path=None, id_type='gene', upstream=2000, downstream=0, include_gene_body=True, min_fragments_per_cell=100, exclude_chroms=None, metrics=None, filter_query=None, barcode_prefix=None, low_memory=False, cell_batch_size=500, gene_name_key='gene_name', gene_id_key='gene_id', transcript_name_key='transcript_name', transcript_id_key='transcript_id')#
Process ATAC fragments parquet file and generate gene activity matrix.
Uses paired-insertion counting strategy matching SnapATAC2: each fragment contributes insertions at start and end positions. If both insertions fall within the same gene’s regulatory domain, count +1. If in different genes, each gene gets +1.
- Parameters:
- input_parquet
strorPath Path to input parquet file containing ATAC fragments
- gene_anno
strorPath Path to GTF/GFF gene annotation file.
- output_path
strorPath, optional Path for output .h5ad file. If None, uses input filename.
- id_type
str “gene” or “transcript” - which feature type to use (default: “gene”).
- upstream
int Base pairs upstream of TSS to include (default: 2000).
- downstream
int Base pairs downstream of regulatory domain (default: 0).
- include_gene_body
bool Whether to include the gene body in the regulatory domain (default: True).
- min_fragments_per_cell
int Minimum fragments required per barcode (default: 100)
- exclude_chroms
list, optional List of chromosomes to exclude. (default: None)
- metrics
str,Path, orcudf.DataFrame, optional Path to a CSV file or a cuDF DataFrame containing cell metrics for filtering.
- filter_query
str, optional Query string for filtering cells based on metrics (e.g. “tsse_score > 5”).
- barcode_prefix
str, optional Prefix to add to barcodes
- low_memory
bool Use low memory mode for Parquet reading (default: False)
- cell_batch_size
int Number of cells to process per batch (default: 500). Lower values reduce GPU memory usage but may be slower.
- gene_name_key
str Key for gene name in GTF attributes (default: “gene_name”).
- gene_id_key
str Key for gene ID in GTF attributes (default: “gene_id”).
- transcript_name_key
str Key for transcript name in GTF attributes (default: “transcript_name”).
- transcript_id_key
str Key for transcript ID in GTF attributes (default: “transcript_id”).
- input_parquet
- Returns:
adata : AnnData AnnData object with gene activity matrix
- Return type:
sc.AnnData
Examples
>>> import gatac as ga >>> adata_gene = ga.pp.make_gene_matrix( ... "pbmc_filtered.parquet", ... gene_anno="GRCh38.gtf.gz", ... id_type="gene", ... upstream=2000, ... downstream=0, ... include_gene_body=True, ... )