gatac.pp.make_parquet#
- gatac.pp.make_parquet(input_path, output_path=None, separator='\t', barcode_prefix=None, row_group_size=1000000)#
Convert ATAC fragments TSV.GZ file to Parquet format.
Uses DuckDB for parallel decompression and parsing, enabling processing of files larger than RAM.
- Parameters:
- input_path
strorPath Path to input .tsv.gz (or .bed.gz) file with ATAC fragments. Expected columns: chrom, start, end, barcode, count
- output_path
strorPath, optional Path for output .parquet file. If None, uses input filename with .parquet extension.
- separator
str Column separator (default: tab)
- barcode_prefix
str, optional Prefix to prepend to barcodes (e.g. “sample1#”)
- row_group_size
int Number of rows per Parquet row group (default: 1_000_000). Larger groups reduce GPU kernel-launch overhead when streaming via cudf.read_parquet(row_groups=…). The default is tuned for the gatac batch size of 64 row groups (~64 M rows/batch fits comfortably on a 12 GB GPU).
- input_path
- Returns:
Path Path to the created Parquet file.
- Return type:
Examples
>>> import gatac as ga >>> out = ga.pp.make_parquet("pbmc.tsv.gz") >>> # Or with a sample-specific barcode prefix >>> out = ga.pp.make_parquet("pbmc.tsv.gz", barcode_prefix="sample1#")