gatac.tl.make_peak_matrix#
- gatac.tl.make_peak_matrix(adata, parquet_path, *, use_rep='peaks', peak_file=None, genome='hg38', counting_strategy='paired-insertion', inplace=False, batch_size=50, verbose=True, filter_chromosomes=True)#
Generate cell by peak count matrix.
This function counts fragment insertions overlapping with peak regions to create a cell × peak count matrix. Efficiently processes multiple parquet files using batched row group reading to minimize memory usage.
- Parameters:
- adata AnnData
The annotated data matrix (typically a tiled matrix). If source_file is in obs, it’s used to map cells to parquet files.
- parquet_path Union[str, Path, list[Union[str, Path]]]
Path to parquet file, directory containing parquet files, or a list of parquet file paths. - If a single file: all cells are read from this file. - If a directory: uses source_file obs to find specific files. - If a list of files: processes all files, matching barcodes to cells.
- use_rep str
Key in .uns containing peak information. The peaks can be a DataFrame with ‘chrom’, ‘start’, ‘end’ columns.
- peak_file Optional[Path]
BED file containing peaks. If provided, peak information will be read from this file instead of use_rep.
- genome Union[str, dict]
Genome name (e.g., ‘hg38’, ‘mm10’) or dict of chromosome sizes. Used for validation and chromosome filtering.
- counting_strategy str
Counting strategy for peak matrix generation.
"paired-insertion"counts Tn5 insertions at fragment ends and counts a fragment once if both ends land in the same peak; this matches SnapATAC2’s default."insertion"counts each insertion separately, so a fragment with both ends in a peak contributes 2."fragment"is the legacy mode that counts any fragment overlapping the peak.- inplace bool
Whether to add the peak matrix to the AnnData object (not recommended, will replace .X). If False, returns a new AnnData object.
- batch_size int
Number of parquet row groups to load at once (default: 50). Larger values are faster but use more GPU memory.
- verbose Union[bool, str]
Controls progress reporting.
Falsedisables progress output.Trueor"tqdm"shows a tqdm progress bar."log"prints one line per file with ETA for non-interactive environments.- filter_chromosomes
bool If True, only keep fragments from chromosomes recognized by the provided genome (keys if genome is a dict, or standard chromosomes for the named genome) (default: True).
- Returns:
AnnData or None If inplace=False, returns a new AnnData object with the peak matrix. Otherwise returns None and modifies adata in place.
- Return type:
Optional[‘AnnData’]
Examples
>>> import gatac >>> # After calling peaks and merging >>> gatac.tl.merge_peaks(adata, use_rep="gmacs", key_added="peaks") >>> # Create peak matrix from a single file >>> peak_mat = gatac.tl.make_peak_matrix( ... adata, ... parquet_path="/path/to/fragments.parquet", ... use_rep="peaks", ... inplace=False, ... ) >>> # Create peak matrix from multiple files >>> peak_mat = gatac.tl.make_peak_matrix( ... adata, ... parquet_path=["/path/to/sample1.parquet", "/path/to/sample2.parquet"], ... use_rep="peaks", ... inplace=False, ... ) >>> # Create peak matrix from directory (uses source_file obs) >>> peak_mat = gatac.tl.make_peak_matrix( ... adata, ... parquet_path="/path/to/parquet_dir/", ... use_rep="peaks", ... inplace=False, ... )