gatac.tl.make_peak_matrix

gatac.tl.make_peak_matrix#

gatac.tl.make_peak_matrix(adata, parquet_path, *, use_rep='peaks', peak_file=None, genome='hg38', counting_strategy='paired-insertion', inplace=False, batch_size=50, verbose=True, filter_chromosomes=True)#

Generate cell by peak count matrix.

This function counts fragment insertions overlapping with peak regions to create a cell × peak count matrix. Efficiently processes multiple parquet files using batched row group reading to minimize memory usage.

Parameters:

adata AnnData: The annotated data matrix (typically a tiled matrix). If source_file is in obs, it’s used to map cells to parquet files.
parquet_path Union[str, Path, list[Union[str, Path]]]: Path to parquet file, directory containing parquet files, or a list of parquet file paths. - If a single file: all cells are read from this file. - If a directory: uses source_file obs to find specific files. - If a list of files: processes all files, matching barcodes to cells.
use_rep str: Key in .uns containing peak information. The peaks can be a DataFrame with ‘chrom’, ‘start’, ‘end’ columns.
peak_file Optional[Path]: BED file containing peaks. If provided, peak information will be read from this file instead of use_rep.
genome Union[str, dict]: Genome name (e.g., ‘hg38’, ‘mm10’) or dict of chromosome sizes. Used for validation and chromosome filtering.
counting_strategy str: Counting strategy for peak matrix generation. "paired-insertion" counts Tn5 insertions at fragment ends and counts a fragment once if both ends land in the same peak; this matches SnapATAC2’s default. "insertion" counts each insertion separately, so a fragment with both ends in a peak contributes 2. "fragment" is the legacy mode that counts any fragment overlapping the peak.
inplace bool: Whether to add the peak matrix to the AnnData object (not recommended, will replace .X). If False, returns a new AnnData object.
batch_size int: Number of parquet row groups to load at once (default: 50). Larger values are faster but use more GPU memory.
verbose Union[bool, str]: Controls progress reporting. False disables progress output. True or "tqdm" shows a tqdm progress bar. "log" prints one line per file with ETA for non-interactive environments.
filter_chromosomes bool: If True, only keep fragments from chromosomes recognized by the provided genome (keys if genome is a dict, or standard chromosomes for the named genome) (default: True).

Returns:

AnnData or None If inplace=False, returns a new AnnData object with the peak matrix. Otherwise returns None and modifies adata in place.

Return type:

Optional[‘AnnData’]

Examples

>>> import gatac
>>> # After calling peaks and merging
>>> gatac.tl.merge_peaks(adata, use_rep="gmacs", key_added="peaks")
>>> # Create peak matrix from a single file
>>> peak_mat = gatac.tl.make_peak_matrix(
...     adata,
...     parquet_path="/path/to/fragments.parquet",
...     use_rep="peaks",
...     inplace=False,
... )
>>> # Create peak matrix from multiple files
>>> peak_mat = gatac.tl.make_peak_matrix(
...     adata,
...     parquet_path=["/path/to/sample1.parquet", "/path/to/sample2.parquet"],
...     use_rep="peaks",
...     inplace=False,
... )
>>> # Create peak matrix from directory (uses source_file obs)
>>> peak_mat = gatac.tl.make_peak_matrix(
...     adata,
...     parquet_path="/path/to/parquet_dir/",
...     use_rep="peaks",
...     inplace=False,
... )

gatac.tl.make_peak_matrix

Contents

gatac.tl.make_peak_matrix#