Credits & Attributions#

AI-assisted development#

Most of GATAC code was written with AI assistance, used responsibly under strict validation against the expected behavior of the original CPU-based tools being translated. The bulk of the core feature porting/translations and GPU optimization work was carried out with Claude Opus 4.6 and GPT-5.4, while Claude Sonnet 4.6 was used for secondary development tasks.

All AI-assisted implementations were tightly constrained to match reference outputs from SnapATAC2, ArchR, chromVAR, MACS3, and AMULET. Results were also thoroughly tested for reproducibility on the Reproducibility page and in the associated test suite.

Algorithms#

AMULET#

GATAC’s doublet/multiplet detection (gatac.pp.detect_doublets) is a direct port of the AMULET algorithm of Thibodeau A. et al., Genome Biology, 2021.

The overlap-detection sweep-line, the per-cell Poisson scoring, the row-sum Poisson repeat-inference, and the BH-FDR correction are translated line-for-line from the upstream Python source (FragmentFileOverlapCounter.py and AMULET.py). The data flow has been rewritten to operate on GATAC’s parquet fragment files via DuckDB and to parallelize per-chromosome with a worker pool; the optional repeat-filter pass is applied at the raw-read level rather than the overlap level. The original AMULET v1.1 release is run as an external subprocess during the reproducibility tests, where it agrees with GATAC exactly (Jaccard 1.000, q-value Pearson r 1.000 on 13,735 cells × 22 autosomes).

The original AMULET tool is distributed under the GNU GPL v3.0. No AMULET source code is bundled with or linked into GATAC.

Third-party code#

scPrinter#

GATAC’s chromvar implementation originated from the scPrinter repository and underwent additional work for GPU memory optimization and closer agreement with the original chromVAR R results.

gmacs#

Portions of GATAC’s GPU-accelerated peak-calling code are derived from gmacs — a GPU-accelerated implementation of the MACS3 algorithm developed by Harihara Subrahmaniam Muralidharan at LatchBio.

The relevant code has been extensively modified for integration into the GATAC pipeline. Key updates include adaptation to the GATAC Parquet/cuDF data model, enhanced memory management through streaming, and a refactored API designed to align with the scverse ecosystem.

gmacs is dual-licensed under CC0 1.0 Universal and the MIT License.