gatac.pp.detect_doublets

gatac.pp.detect_doublets#

gatac.pp.detect_doublets(fragment_path, chrom_sizes, barcodes=None, min_fragments=100, expected_overlap=2, max_insert_size=900, q_threshold=0.01, q_rep_threshold=0.01, repeat_filter=None, min_overlap_bp=1, n_threads=1)#

AMULET doublet/multiplet detection from a GATAC parquet fragment file.

Implements the original AMULET Poisson method of Thibodeau et al. (2021): cells with an abnormally high number of overlapping fragment insertions are flagged as doublets/multiplets.

Parameters:

fragment_path str or Path: GATAC parquet fragment file (output of gatac convert).
chrom_sizes dict or str: Chromosome sizes dict, or genome name (e.g. ‘hg38’, ‘mm10’).
barcodes list of str, optional: Barcodes to test. If None, all barcodes with >= min_fragments fragments are used.
min_fragments int: Minimum unique fragments per cell to include (default 100).
expected_overlap int: Expected number of reads overlapping (default 2).
max_insert_size int: Maximum fragment insert size in bp (default 900).
q_threshold float: FDR threshold for doublet calling (default 0.01).
q_rep_threshold float: FDR threshold for inferring repetitive regions (default 0.01).
repeat_filter str or Path, optional: BED file of known repetitive regions.
min_overlap_bp int: Minimum overlap length in bp to retain (default 1).
n_threads int: Parallel workers for overlap detection (default 1).

Returns:

pd.DataFrame Per-cell results with columns: cell_id, p_value, q_value, is_doublet.

Return type:

DataFrame

Notes

Only canonical autosomes (chr1..chrN) are considered: sex chromosomes, mitochondria, and decoy contigs are dropped, matching the default behaviour of the original AMULET v1.1 tool (human_autosomes.txt). AMULET’s Poisson model assumes a uniform single-copy background signal which is not valid for chrX, chrY, chrM, or unplaced contigs.

Examples

>>> import gatac as ga
>>> result = ga.pp.detect_doublets("pbmc.parquet", chrom_sizes="hg38")
>>> result.columns.tolist()
['cell_id', 'p_value', 'q_value', 'is_doublet']
>>> # Filter cells to keep only singlets
>>> doublets = set(result.loc[result["is_doublet"], "cell_id"])
>>> keep = adata[~adata.obs_names.isin(doublets)].copy()

gatac.pp.detect_doublets

Contents

gatac.pp.detect_doublets#