gatac.tl.sample_gc_matched_background

gatac.tl.sample_gc_matched_background#

gatac.tl.sample_gc_matched_background(regions, genome_fasta, *, background_pool, n_background=None, n_bins=50, replace=True, random_state=0)#

Sample background peaks whose GC-content distribution matches target peaks.

Parameters:
regions list[str] or dict[str, list[str]]

Target peaks to match. When a dict is provided, each group is sampled independently and the return value mirrors the same keys.

genome_fasta str or Path

Path to the reference genome FASTA used to compute GC content.

background_pool list[str] or dict[str, list[str]]

Candidate background peaks to sample from. When regions is a dict, this can be either one shared pool or a dict keyed like regions.

n_background int, optional

Number of peaks to sample per target group. Defaults to the size of the corresponding target set.

n_bins int, default 50

Matching resolution. Larger values enforce tighter GC matching.

replace bool, default True

Whether sampled background peaks may be reused. Set to False to require unique sampled peaks within each returned background set.

random_state int, optional

Seed for deterministic sampling.

Returns:

list[str] or dict[str, list[str]] GC-matched background peaks with the same container shape as regions.

Return type:

list[str] | dict[str, list[str]]

Examples

>>> matched_bg = ga.tl.sample_gc_matched_background(
...     da_peaks,
...     genome_fasta="genome.fa",
...     background_pool=all_peaks,
... )
>>> matched_bg_by_group = ga.tl.sample_gc_matched_background(
...     marker_peaks,
...     genome_fasta="genome.fa",
...     background_pool=list(peak_adata.var_names),
...     replace=False,
... )