gatac doublets#

Detect doublet / multiplet cells in a single-cell ATAC-seq sample using the AMULET Poisson method of Thibodeau et al. (2021). Cells with significantly more overlapping insertions than expected are flagged as doublets.


Synopsis#

gatac doublets <input.parquet> -g <genome> [-o OUTPUT]
                 [-m MIN_FRAGS] [--q Q] [--q-rep Q_REP]
                 [--expected-overlap N] [--max-insert N]
                 [--min-overlap N] [--repeat-filter BED]
                 [-j THREADS]

Arguments#

Positional#

Argument

Description

input.parquet

Path to the fragment Parquet file

Options#

Flag

Default

Description

-g, --genome

required

Genome name (hg38, mm10, …) or path to a chromosome sizes file

-o, --output

<input>_doublets.csv

Output CSV path

-m, --min-fragments

100

Minimum unique fragments per cell to include

--q

0.01

FDR threshold for doublet calling

--q-rep

0.01

FDR threshold for inferring repetitive regions

--expected-overlap

2

Expected number of reads overlapping (Poisson mean)

--max-insert

900

Maximum fragment insert size in bp

--min-overlap

1

Minimum overlap length in bp

--repeat-filter

BED file of known repetitive regions to exclude

-j, --threads

1

Parallel workers for overlap detection


Output#

A CSV with one row per cell:

Column

Description

cell_id

Cell barcode

p_value

Poisson p-value for the observed overlap count

q_value

Benjamini–Hochberg adjusted p-value

is_doublet

True if q_value < --q


Examples#

Basic usage#

gatac doublets pbmc.parquet -g hg38 -o pbmc_doublets.csv

Stricter threshold#

gatac doublets pbmc.parquet -g hg38 --q 0.001 -o pbmc_doublets.csv

Exclude known repetitive regions#

gatac doublets pbmc.parquet -g hg38 --repeat-filter hg38_repeats.bed

Python equivalent#

import gatac as ga

result = ga.pp.detect_doublets(
    "pbmc.parquet",
    chrom_sizes="hg38",
    min_fragments=100,
    q_threshold=0.01,
)
print(result.head())
#   cell_id   p_value   q_value  is_doublet
# 0  AAAC...  0.8321   0.9412     False
# 1  AAAC...  0.0001   0.0034     True
# ...

# Drop doublets from an AnnData
doublets = set(result.loc[result["is_doublet"], "cell_id"])
adata = adata[~adata.obs_names.isin(doublets)].copy()

Notes#

AMULET’s Poisson model assumes a uniform single-copy background and is designed for autosomes only. Make sure the supplied chrom_sizes covers the autosomes of your species (the built-in hg38, hg19, mm10, mm39 aliases do).