gatac.pp.detect_doublets

gatac.pp.detect_doublets#

gatac.pp.detect_doublets(fragment_path, chrom_sizes, barcodes=None, min_fragments=100, expected_overlap=2, max_insert_size=900, q_threshold=0.01, q_rep_threshold=0.01, repeat_filter=None, min_overlap_bp=1, n_threads=1)#

AMULET doublet/multiplet detection from a GATAC parquet fragment file.

Implements the original AMULET Poisson method of Thibodeau et al. (2021): cells with an abnormally high number of overlapping fragment insertions are flagged as doublets/multiplets.

Parameters:
fragment_path str or Path

GATAC parquet fragment file (output of gatac convert).

chrom_sizes dict or str

Chromosome sizes dict, or genome name (e.g. ‘hg38’, ‘mm10’).

barcodes list of str, optional

Barcodes to test. If None, all barcodes with >= min_fragments fragments are used.

min_fragments int

Minimum unique fragments per cell to include (default 100).

expected_overlap int

Expected number of reads overlapping (default 2).

max_insert_size int

Maximum fragment insert size in bp (default 900).

q_threshold float

FDR threshold for doublet calling (default 0.01).

q_rep_threshold float

FDR threshold for inferring repetitive regions (default 0.01).

repeat_filter str or Path, optional

BED file of known repetitive regions.

min_overlap_bp int

Minimum overlap length in bp to retain (default 1).

n_threads int

Parallel workers for overlap detection (default 1).

Returns:

pd.DataFrame Per-cell results with columns: cell_id, p_value, q_value, is_doublet.

Return type:

DataFrame

Notes

Only canonical autosomes (chr1..chrN) are considered: sex chromosomes, mitochondria, and decoy contigs are dropped, matching the default behaviour of the original AMULET v1.1 tool (human_autosomes.txt). AMULET’s Poisson model assumes a uniform single-copy background signal which is not valid for chrX, chrY, chrM, or unplaced contigs.

Examples

>>> import gatac as ga
>>> result = ga.pp.detect_doublets("pbmc.parquet", chrom_sizes="hg38")
>>> result.columns.tolist()
['cell_id', 'p_value', 'q_value', 'is_doublet']
>>> # Filter cells to keep only singlets
>>> doublets = set(result.loc[result["is_doublet"], "cell_id"])
>>> keep = adata[~adata.obs_names.isin(doublets)].copy()