gatac.pp.filter_fragments

gatac.pp.filter_fragments#

gatac.pp.filter_fragments(input_parquet, output_parquet=None, metrics=None, min_fragments_per_cell=100, filter_query=None, barcode_column='barcode', barcode_prefix=None, row_groups_per_batch=64, chrom_sizes=None)#

Filter ATAC fragment parquet file(s) based on cell quality metrics.

Uses GPU acceleration and streaming writes for memory-efficient processing of large files. Supports both single and multiple input files.

Parameters:
input_parquet str, Path, or list

Path to input parquet file(s) containing fragment data. Expected columns: ‘chrom’, ‘start’, ‘end’, ‘barcode’, ‘count’

output_parquet str, Path, or list, optional

Path for output filtered parquet file(s). If None, will use input name with ‘_filtered’ suffix. For multiple inputs, can be None or a list of same length.

metrics str, Path, pandas.DataFrame, or cudf.DataFrame, optional

Cell quality metrics to use for filtering. Can be a path to a CSV file, or a pre-loaded DataFrame.

min_fragments_per_cell int, default 100

Minimum number of unique fragments required per cell.

filter_query str, optional

Query string for filtering cells based on metrics.

barcode_column str, default "barcode"

Name of the barcode column.

barcode_prefix str, optional

Prefix to add to barcodes before filtering.

row_groups_per_batch int, default 64

Number of parquet row groups to process per batch.

chrom_sizes dict, str, or genome object, optional

Chromosome sizes for filtering. Can be: - String genome name (e.g., ‘hg38’, ‘mm10’) - will use built-in genome - Dictionary of chromosome names to sizes - Genome object with chrom_sizes attribute Only fragments on these chromosomes will be counted. If None, all chromosomes are included. This matches SnapATAC2’s behavior of excluding non-standard contigs (GL*, KI*).

Returns:

output_path : Path or list of Path Path(s) to the filtered parquet file(s).

Return type:

Path | List[Path]

Examples

>>> # Filter by minimum fragment count only
>>> filter_fragments("fragments.parquet", min_fragments_per_cell=500)
>>> # Filter using metrics CSV with quality threshold
>>> filter_fragments(
...     "fragments.parquet",
...     metrics="metrics.csv",
...     filter_query="tsse_score > 5 and n_unique > 1000"
... )
>>> # Filter multiple samples
>>> filter_fragments(
...     ["sample1.parquet", "sample2.parquet"],
...     metrics=combined_metrics_df,
...     filter_query="tsse_score > 5"
... )