gatac.tl.make_peak_matrix

gatac.tl.make_peak_matrix#

gatac.tl.make_peak_matrix(adata, parquet_path, *, use_rep='peaks', peak_file=None, genome='hg38', counting_strategy='paired-insertion', inplace=False, batch_size=50, verbose=True, filter_chromosomes=True)#

Generate cell by peak count matrix.

This function counts fragment insertions overlapping with peak regions to create a cell × peak count matrix. Efficiently processes multiple parquet files using batched row group reading to minimize memory usage.

Parameters:
adata AnnData

The annotated data matrix (typically a tiled matrix). If source_file is in obs, it’s used to map cells to parquet files.

parquet_path Union[str, Path, list[Union[str, Path]]]

Path to parquet file, directory containing parquet files, or a list of parquet file paths. - If a single file: all cells are read from this file. - If a directory: uses source_file obs to find specific files. - If a list of files: processes all files, matching barcodes to cells.

use_rep str

Key in .uns containing peak information. The peaks can be a DataFrame with ‘chrom’, ‘start’, ‘end’ columns.

peak_file Optional[Path]

BED file containing peaks. If provided, peak information will be read from this file instead of use_rep.

genome Union[str, dict]

Genome name (e.g., ‘hg38’, ‘mm10’) or dict of chromosome sizes. Used for validation and chromosome filtering.

counting_strategy str

Counting strategy for peak matrix generation. "paired-insertion" counts Tn5 insertions at fragment ends and counts a fragment once if both ends land in the same peak; this matches SnapATAC2’s default. "insertion" counts each insertion separately, so a fragment with both ends in a peak contributes 2. "fragment" is the legacy mode that counts any fragment overlapping the peak.

inplace bool

Whether to add the peak matrix to the AnnData object (not recommended, will replace .X). If False, returns a new AnnData object.

batch_size int

Number of parquet row groups to load at once (default: 50). Larger values are faster but use more GPU memory.

verbose Union[bool, str]

Controls progress reporting. False disables progress output. True or "tqdm" shows a tqdm progress bar. "log" prints one line per file with ETA for non-interactive environments.

filter_chromosomes bool

If True, only keep fragments from chromosomes recognized by the provided genome (keys if genome is a dict, or standard chromosomes for the named genome) (default: True).

Returns:

AnnData or None If inplace=False, returns a new AnnData object with the peak matrix. Otherwise returns None and modifies adata in place.

Return type:

Optional[‘AnnData’]

Examples

>>> import gatac
>>> # After calling peaks and merging
>>> gatac.tl.merge_peaks(adata, use_rep="gmacs", key_added="peaks")
>>> # Create peak matrix from a single file
>>> peak_mat = gatac.tl.make_peak_matrix(
...     adata,
...     parquet_path="/path/to/fragments.parquet",
...     use_rep="peaks",
...     inplace=False,
... )
>>> # Create peak matrix from multiple files
>>> peak_mat = gatac.tl.make_peak_matrix(
...     adata,
...     parquet_path=["/path/to/sample1.parquet", "/path/to/sample2.parquet"],
...     use_rep="peaks",
...     inplace=False,
... )
>>> # Create peak matrix from directory (uses source_file obs)
>>> peak_mat = gatac.tl.make_peak_matrix(
...     adata,
...     parquet_path="/path/to/parquet_dir/",
...     use_rep="peaks",
...     inplace=False,
... )