GATAC: GPU-Accelerated scATACseq Analysis#
GATAC is a GPU-accelerated toolkit for end-to-end ATAC-seq data processing. Starting from raw fragment files, it produces analysis-ready sparse matrices (tile, peak, and gene activity) that integrate directly with the scverse ecosystem — AnnData, Scanpy, and SnapATAC2. Computations are offloaded to the GPU via RAPIDS cuDF, CuPy, and cuML, delivering 10–50× speedups over CPU-only workflows on typical single-cell datasets.
GATAC uses Apache Parquet as its staging format. Columnar layout and built-in compression let RAPIDS cuDF stream data directly into GPU memory with zero CPU round-trips — enabling aggregation over datasets that far exceed available GPU RAM.
GATAC reproduces the core operations of established tools like SnapATAC2, ArchR, MACS3, and chromVAR within a unified framework. It produces standard AnnData objects fully compatible with the scverse ecosystem.
What can GATAC do?#
Convert raw TSV.GZ fragment files to columnar Parquet for fast GPU streaming, with optional barcode prefixing for multi-sample projects.
GPU QC metrics (TSSe, unique fragments, duplicate & mito fraction) plus threshold-based barcode filtering via a Polars query engine.
Flag doublet / multiplet cells via the AMULET Poisson overlap test.
Bin the genome into fixed-size tiles and produce a sparse cell × bin count matrix compatible with SnapATAC2.
Score gene activity from paired insertion counts over promoter + gene body regions, or compute ArchR-style distance-weighted gene scores — all from a GTF annotation.
GPU-accelerated selection of the most accessible genomic features across one or many h5ad files using streaming aggregation.
Spectral decomposition of the cell × feature matrix for dimensionality reduction, UMAP, and clustering.
Call peaks per cell-type group, merge overlapping peaks across groups, and build a cell × peak count matrix.
Scan peaks for TF binding motifs (MEME format), run motif enrichment tests, and compute chromVAR deviation scores.