gatac.pp.make_tile_matrix

gatac.pp.make_tile_matrix#

gatac.pp.make_tile_matrix(input, chrom_sizes, output_path=None, tile_size=5000, min_fragments_per_cell=100, exclude_chroms=['chrM', 'chrY', 'M', 'Y'], metrics=None, filter_query=None, count_strategy='unique', barcode_prefix=None, low_memory=False, row_groups_per_batch=64)#

Generate a tile matrix from fragments or interval-like feature matrices.

For fragment parquet input, row-groups are streamed in batches so the full file never needs to reside in GPU memory. Within each batch, tiles are computed for all chromosomes at once.

For .h5ad or 10x .h5 input, features with interval-like names such as chr1:100-200 in var_names are detected and aggregated into fixed tiles by overlap.

Parameters:

input str, Path, or AnnData: Fragment parquet path, interval-matrix .h5ad path, 10x .h5 path, or an in-memory AnnData object containing interval-like features in var_names.
chrom_sizes dict or str: Dictionary of chromosome names and their sizes, or a genome name (e.g., ‘hg38’).
output_path str or Path, optional: Path for output .h5ad file. If None, the function returns the AnnData object without writing to disk.
tile_size int: Size of genomic bins in base pairs (default: 5000)
min_fragments_per_cell int: Minimum fragments required per barcode (default: 100)
exclude_chroms list, optional: List of chromosomes to exclude. (default: [“chrM”, “chrY”, “M”, “Y”])
metrics str, Path, or cudf.DataFrame, optional: Path to a CSV file or a cuDF DataFrame containing cell metrics for filtering.
filter_query str, optional: Query string for filtering cells based on metrics (e.g. “tsse_score > 5”).
count_strategy str: Strategy for counting fragments in tiles. Options: - “unique”: Count each unique fragment once (SnapATAC2 default) - “count”: Use PCR duplicate counts from the ‘count’ column - “binarize”: Convert counts to binary (0/1) per tile (default: “unique”)
barcode_prefix str, optional: Prefix to add to barcodes
low_memory bool: Use smaller batch size for Parquet reading (default: False)
row_groups_per_batch int: Number of Parquet row-groups to read per GPU batch (default: 64). Decrease for lower GPU memory usage.

Returns:

adata : AnnData AnnData object with tile matrix

Return type:

ad.AnnData

Examples

>>> import gatac as ga
>>> # From a GATAC parquet fragment file, using a built-in genome
>>> adata = ga.pp.make_tile_matrix(
...     "pbmc_filtered.parquet",
...     chrom_sizes="hg38",
...     tile_size=500,
...     min_fragments_per_cell=200,
...     exclude_chroms=["chrM", "chrY"],
... )
>>> # Or with a custom chromosome-sizes dict
>>> adata = ga.pp.make_tile_matrix(
...     "pbmc_filtered.parquet",
...     chrom_sizes={"chr1": 248956422, "chr2": 242193529},
... )
>>> # From a 10x .h5 (or interval-like .h5ad) input
>>> adata = ga.pp.make_tile_matrix(
...     "filtered_peak_bc_matrix.h5",
...     chrom_sizes="hg38",
...     tile_size=500,
... )

gatac.pp.make_tile_matrix

Contents

gatac.pp.make_tile_matrix#