GATAC: GPU-Accelerated scATACseq Analysis

GATAC: GPU-Accelerated scATACseq Analysis#

GATAC is a GPU-accelerated toolkit for end-to-end ATAC-seq data processing. Starting from raw fragment files, it produces analysis-ready sparse matrices (tile, peak, and gene activity) that integrate directly with the scverse ecosystem — AnnData, Scanpy, and SnapATAC2. Computations are offloaded to the GPU via RAPIDS cuDF, CuPy, and cuML, delivering 10–50× speedups over CPU-only workflows on typical single-cell datasets.


Parquet-Native Pipeline

GATAC uses Apache Parquet as its staging format. Columnar layout and built-in compression let RAPIDS cuDF stream data directly into GPU memory with zero CPU round-trips — enabling aggregation over datasets that far exceed available GPU RAM.

Ecosystem Integration

GATAC reproduces the core operations of established tools like SnapATAC2, ArchR, MACS3, and chromVAR within a unified framework. It produces standard AnnData objects fully compatible with the scverse ecosystem.


What can GATAC do?#

Fragment I/O

Convert raw TSV.GZ fragment files to columnar Parquet for fast GPU streaming, with optional barcode prefixing for multi-sample projects.

→ CLI: convert · → API: pp

Quality & Filtering

GPU QC metrics (TSSe, unique fragments, duplicate & mito fraction) plus threshold-based barcode filtering via a Polars query engine.

→ CLI: metrics · → CLI: filter · → API: pp

Doublet Detection

Flag doublet / multiplet cells via the AMULET Poisson overlap test.

→ CLI: doublets · → API: pp

Tile Matrix

Bin the genome into fixed-size tiles and produce a sparse cell × bin count matrix compatible with SnapATAC2.

→ CLI: tile · → API: pp

Gene Processing

Score gene activity from paired insertion counts over promoter + gene body regions, or compute ArchR-style distance-weighted gene scores — all from a GTF annotation.

→ CLI: gene · → CLI: genescore · → API: pp

Feature Selection

GPU-accelerated selection of the most accessible genomic features across one or many h5ad files using streaming aggregation.

→ CLI: features · → API: pp

Spectral Embedding

Spectral decomposition of the cell × feature matrix for dimensionality reduction, UMAP, and clustering.

→ API: tl

Peak Calling

Call peaks per cell-type group, merge overlapping peaks across groups, and build a cell × peak count matrix.

→ API: tl

Motif Analysis

Scan peaks for TF binding motifs (MEME format), run motif enrichment tests, and compute chromVAR deviation scores.

→ API: tl