gatac.pp.make_gene_score_matrix

gatac.pp.make_gene_score_matrix#

gatac.pp.make_gene_score_matrix(input_parquet, gene_anno, output_path=None, gene_model='exp(-abs(x)/5000) + exp(-1)', tile_size=500, extend_upstream=(1000, 100000), extend_downstream=(1000, 100000), gene_upstream=5000, gene_downstream=0, use_gene_boundaries=True, use_tss=False, ceiling=4, gene_scale_factor=5.0, scale_to=10000.0, exclude_chroms=('chrY', 'chrM'), min_fragments_per_cell=100, metrics=None, filter_query=None, barcode_prefix=None, low_memory=False, cell_batch_size=None, gene_name_key='gene_name', gene_id_key='gene_id')#

GPU-accelerated ArchR-style gene activity score matrix.

Faithful port of ArchR addGeneScoreMatrix. See module docstring for the algorithm. Parameter defaults match ArchR’s defaults.

Parameters:
input_parquet str | Path

ATAC fragments parquet (columns: chrom, start, end, barcode, count).

gene_anno str | Path

GTF/GFF gene annotation, or a CSV with columns symbol, seqnames, start, end, strand.

gene_model Union[str, Callable]

ArchR geneModel: an expression string in x (signed distance to TSS), or a Python callable f(x)->weight.

tile_size int

ArchR tiling / capping / normalisation / gene-width-weight parameters.

ceiling int

ArchR tiling / capping / normalisation / gene-width-weight parameters.

scale_to float

ArchR tiling / capping / normalisation / gene-width-weight parameters.

gene_scale_factor float

ArchR tiling / capping / normalisation / gene-width-weight parameters.

extend_upstream Tuple[int, int]

(min, max) bp extension used for the regulatory search window.

extend_downstream Tuple[int, int]

(min, max) bp extension used for the regulatory search window.

gene_upstream int

bp the gene body is grown before the model is applied.

gene_downstream int

bp the gene body is grown before the model is applied.

use_gene_boundaries bool

Clip windows so tiles cannot contribute across a neighbouring gene.

use_tss bool

Build the model on the 1bp TSS rather than the gene body.

output_path Optional[str | Path]

exclude_chroms Optional[list]

min_fragments_per_cell int

metrics Optional[str | Path | 'cudf.DataFrame']

filter_query Optional[str]

barcode_prefix Optional[str]

low_memory bool

cell_batch_size Optional[int]

gene_name_key str

gene_id_key str

Returns:

AnnData of shape (cells, genes) with normalised gene scores.

Return type:

sc.AnnData