Preprocessing — gatac.pp

Preprocessing — gatac.pp#

The gatac.pp namespace covers the full preprocessing pipeline: reading raw fragment files, computing quality metrics, filtering barcodes, and building count matrices.


Fragment I/O#

Convert raw fragment TSV.GZ files to columnar Parquet for efficient GPU streaming. Parquet preserves row-group structure so GATAC can process files larger than GPU memory.

make_parquet

Convert ATAC fragments TSV.GZ file to Parquet format.

make_parquet_batch

Convert multiple ATAC fragment TSV.GZ files to Parquet in parallel.

read_fragments_parquet

Read ATAC fragments from Parquet file optimized for GPU memory.


Quality metrics & filtering#

Compute TSS enrichment score and fragment-level statistics entirely on GPU using a streaming approach, filter barcodes by quality thresholds, and detect doublet / multiplet cells with the AMULET Poisson method. filter_fragments accepts a pre-computed metrics DataFrame or CSV and a Polars query string; detect_doublets returns per-cell p/q values and an is_doublet flag.

compute_metrics

Compute TSS enrichment scores using GPU-accelerated streaming via row groups.

filter_fragments

Filter ATAC fragment parquet file(s) based on cell quality metrics.

detect_doublets

AMULET doublet/multiplet detection from a GATAC parquet fragment file.


Matrix processing#

Build cell × feature matrices from QC-filtered fragments, and post-process the resulting .h5ad files. Includes fixed-width genomic bins (make_tile_matrix, compatible with SnapATAC2’s count strategy), gene activity over a GTF annotation — either SnapATAC2-style paired-insertion counts (make_gene_matrix) or ArchR-style distance-weighted gene scores (make_gene_score_matrix, a port of addGeneScoreMatrix) — and operations on existing .h5ad files: combining samples (combine) and selecting the most accessible genomic features across one or many matrices (select_features, select_features_multi).

make_tile_matrix

Generate a tile matrix from fragments or interval-like feature matrices.

make_gene_matrix

Process ATAC fragments parquet file and generate gene activity matrix.

make_gene_score_matrix

GPU-accelerated ArchR-style gene activity score matrix.

select_features

GPU-accelerated feature selection for ATAC-seq tile matrices.

select_features_multi

Streaming feature selection across multiple h5ad files.

combine

Merge multiple h5ad files into a single file with efficient streaming.