gatac.pp.detect_doublets#
- gatac.pp.detect_doublets(fragment_path, chrom_sizes, barcodes=None, min_fragments=100, expected_overlap=2, max_insert_size=900, q_threshold=0.01, q_rep_threshold=0.01, repeat_filter=None, min_overlap_bp=1, n_threads=1)#
AMULET doublet/multiplet detection from a GATAC parquet fragment file.
Implements the original AMULET Poisson method of Thibodeau et al. (2021): cells with an abnormally high number of overlapping fragment insertions are flagged as doublets/multiplets.
- Parameters:
- fragment_path
strorPath GATAC parquet fragment file (output of
gatac convert).- chrom_sizes
dictorstr Chromosome sizes dict, or genome name (e.g. ‘hg38’, ‘mm10’).
- barcodes
listofstr, optional Barcodes to test. If None, all barcodes with
>= min_fragmentsfragments are used.- min_fragments
int Minimum unique fragments per cell to include (default 100).
- expected_overlap
int Expected number of reads overlapping (default 2).
- max_insert_size
int Maximum fragment insert size in bp (default 900).
- q_threshold
float FDR threshold for doublet calling (default 0.01).
- q_rep_threshold
float FDR threshold for inferring repetitive regions (default 0.01).
- repeat_filter
strorPath, optional BED file of known repetitive regions.
- min_overlap_bp
int Minimum overlap length in bp to retain (default 1).
- n_threads
int Parallel workers for overlap detection (default 1).
- fragment_path
- Returns:
pd.DataFrame Per-cell results with columns: cell_id, p_value, q_value, is_doublet.
- Return type:
Notes
Only canonical autosomes (chr1..chrN) are considered: sex chromosomes, mitochondria, and decoy contigs are dropped, matching the default behaviour of the original AMULET v1.1 tool (
human_autosomes.txt). AMULET’s Poisson model assumes a uniform single-copy background signal which is not valid for chrX, chrY, chrM, or unplaced contigs.Examples
>>> import gatac as ga >>> result = ga.pp.detect_doublets("pbmc.parquet", chrom_sizes="hg38") >>> result.columns.tolist() ['cell_id', 'p_value', 'q_value', 'is_doublet'] >>> # Filter cells to keep only singlets >>> doublets = set(result.loc[result["is_doublet"], "cell_id"]) >>> keep = adata[~adata.obs_names.isin(doublets)].copy()