gatac.pp.make_tile_matrix#
- gatac.pp.make_tile_matrix(input, chrom_sizes, output_path=None, tile_size=5000, min_fragments_per_cell=100, exclude_chroms=['chrM', 'chrY', 'M', 'Y'], metrics=None, filter_query=None, count_strategy='unique', barcode_prefix=None, low_memory=False, row_groups_per_batch=64)#
Generate a tile matrix from fragments or interval-like feature matrices.
For fragment parquet input, row-groups are streamed in batches so the full file never needs to reside in GPU memory. Within each batch, tiles are computed for all chromosomes at once.
For .h5ad or 10x .h5 input, features with interval-like names such as
chr1:100-200invar_namesare detected and aggregated into fixed tiles by overlap.- Parameters:
- input
str,Path, orAnnData Fragment parquet path, interval-matrix .h5ad path, 10x .h5 path, or an in-memory AnnData object containing interval-like features in var_names.
- chrom_sizes
dictorstr Dictionary of chromosome names and their sizes, or a genome name (e.g., ‘hg38’).
- output_path
strorPath, optional Path for output .h5ad file. If None, the function returns the AnnData object without writing to disk.
- tile_size
int Size of genomic bins in base pairs (default: 5000)
- min_fragments_per_cell
int Minimum fragments required per barcode (default: 100)
- exclude_chroms
list, optional List of chromosomes to exclude. (default: [“chrM”, “chrY”, “M”, “Y”])
- metrics
str,Path, orcudf.DataFrame, optional Path to a CSV file or a cuDF DataFrame containing cell metrics for filtering.
- filter_query
str, optional Query string for filtering cells based on metrics (e.g. “tsse_score > 5”).
- count_strategy
str Strategy for counting fragments in tiles. Options: - “unique”: Count each unique fragment once (SnapATAC2 default) - “count”: Use PCR duplicate counts from the ‘count’ column - “binarize”: Convert counts to binary (0/1) per tile (default: “unique”)
- barcode_prefix
str, optional Prefix to add to barcodes
- low_memory
bool Use smaller batch size for Parquet reading (default: False)
- row_groups_per_batch
int Number of Parquet row-groups to read per GPU batch (default: 64). Decrease for lower GPU memory usage.
- input
- Returns:
adata : AnnData AnnData object with tile matrix
- Return type:
ad.AnnData
Examples
>>> import gatac as ga >>> # From a GATAC parquet fragment file, using a built-in genome >>> adata = ga.pp.make_tile_matrix( ... "pbmc_filtered.parquet", ... chrom_sizes="hg38", ... tile_size=500, ... min_fragments_per_cell=200, ... exclude_chroms=["chrM", "chrY"], ... ) >>> # Or with a custom chromosome-sizes dict >>> adata = ga.pp.make_tile_matrix( ... "pbmc_filtered.parquet", ... chrom_sizes={"chr1": 248956422, "chr2": 242193529}, ... ) >>> # From a 10x .h5 (or interval-like .h5ad) input >>> adata = ga.pp.make_tile_matrix( ... "filtered_peak_bc_matrix.h5", ... chrom_sizes="hg38", ... tile_size=500, ... )