gatac.pp.make_tile_matrix

gatac.pp.make_tile_matrix#

gatac.pp.make_tile_matrix(input, chrom_sizes, output_path=None, tile_size=5000, min_fragments_per_cell=100, exclude_chroms=['chrM', 'chrY', 'M', 'Y'], metrics=None, filter_query=None, count_strategy='unique', barcode_prefix=None, low_memory=False, row_groups_per_batch=64)#

Generate a tile matrix from fragments or interval-like feature matrices.

For fragment parquet input, row-groups are streamed in batches so the full file never needs to reside in GPU memory. Within each batch, tiles are computed for all chromosomes at once.

For .h5ad or 10x .h5 input, features with interval-like names such as chr1:100-200 in var_names are detected and aggregated into fixed tiles by overlap.

Parameters:
input str, Path, or AnnData

Fragment parquet path, interval-matrix .h5ad path, 10x .h5 path, or an in-memory AnnData object containing interval-like features in var_names.

chrom_sizes dict or str

Dictionary of chromosome names and their sizes, or a genome name (e.g., ‘hg38’).

output_path str or Path, optional

Path for output .h5ad file. If None, the function returns the AnnData object without writing to disk.

tile_size int

Size of genomic bins in base pairs (default: 5000)

min_fragments_per_cell int

Minimum fragments required per barcode (default: 100)

exclude_chroms list, optional

List of chromosomes to exclude. (default: [“chrM”, “chrY”, “M”, “Y”])

metrics str, Path, or cudf.DataFrame, optional

Path to a CSV file or a cuDF DataFrame containing cell metrics for filtering.

filter_query str, optional

Query string for filtering cells based on metrics (e.g. “tsse_score > 5”).

count_strategy str

Strategy for counting fragments in tiles. Options: - “unique”: Count each unique fragment once (SnapATAC2 default) - “count”: Use PCR duplicate counts from the ‘count’ column - “binarize”: Convert counts to binary (0/1) per tile (default: “unique”)

barcode_prefix str, optional

Prefix to add to barcodes

low_memory bool

Use smaller batch size for Parquet reading (default: False)

row_groups_per_batch int

Number of Parquet row-groups to read per GPU batch (default: 64). Decrease for lower GPU memory usage.

Returns:

adata : AnnData AnnData object with tile matrix

Return type:

ad.AnnData

Examples

>>> import gatac as ga
>>> # From a GATAC parquet fragment file, using a built-in genome
>>> adata = ga.pp.make_tile_matrix(
...     "pbmc_filtered.parquet",
...     chrom_sizes="hg38",
...     tile_size=500,
...     min_fragments_per_cell=200,
...     exclude_chroms=["chrM", "chrY"],
... )
>>> # Or with a custom chromosome-sizes dict
>>> adata = ga.pp.make_tile_matrix(
...     "pbmc_filtered.parquet",
...     chrom_sizes={"chr1": 248956422, "chr2": 242193529},
... )
>>> # From a 10x .h5 (or interval-like .h5ad) input
>>> adata = ga.pp.make_tile_matrix(
...     "filtered_peak_bc_matrix.h5",
...     chrom_sizes="hg38",
...     tile_size=500,
... )