gatac.pp.filter_fragments#
- gatac.pp.filter_fragments(input_parquet, output_parquet=None, metrics=None, min_fragments_per_cell=100, filter_query=None, barcode_column='barcode', barcode_prefix=None, row_groups_per_batch=64, chrom_sizes=None)#
Filter ATAC fragment parquet file(s) based on cell quality metrics.
Uses GPU acceleration and streaming writes for memory-efficient processing of large files. Supports both single and multiple input files.
- Parameters:
- input_parquet
str,Path, orlist Path to input parquet file(s) containing fragment data. Expected columns: ‘chrom’, ‘start’, ‘end’, ‘barcode’, ‘count’
- output_parquet
str,Path, orlist, optional Path for output filtered parquet file(s). If None, will use input name with ‘_filtered’ suffix. For multiple inputs, can be None or a list of same length.
- metrics
str,Path,pandas.DataFrame, orcudf.DataFrame, optional Cell quality metrics to use for filtering. Can be a path to a CSV file, or a pre-loaded DataFrame.
- min_fragments_per_cell
int, default100 Minimum number of unique fragments required per cell.
- filter_query
str, optional Query string for filtering cells based on metrics.
- barcode_column
str, default"barcode" Name of the barcode column.
- barcode_prefix
str, optional Prefix to add to barcodes before filtering.
- row_groups_per_batch
int, default64 Number of parquet row groups to process per batch.
- chrom_sizes
dict,str, orgenome object, optional Chromosome sizes for filtering. Can be: - String genome name (e.g., ‘hg38’, ‘mm10’) - will use built-in genome - Dictionary of chromosome names to sizes - Genome object with chrom_sizes attribute Only fragments on these chromosomes will be counted. If None, all chromosomes are included. This matches SnapATAC2’s behavior of excluding non-standard contigs (GL*, KI*).
- input_parquet
- Returns:
output_path : Path or list of Path Path(s) to the filtered parquet file(s).
- Return type:
Examples
>>> # Filter by minimum fragment count only >>> filter_fragments("fragments.parquet", min_fragments_per_cell=500)
>>> # Filter using metrics CSV with quality threshold >>> filter_fragments( ... "fragments.parquet", ... metrics="metrics.csv", ... filter_query="tsse_score > 5 and n_unique > 1000" ... )
>>> # Filter multiple samples >>> filter_fragments( ... ["sample1.parquet", "sample2.parquet"], ... metrics=combined_metrics_df, ... filter_query="tsse_score > 5" ... )