gatac.pp.make_parquet

Contents

gatac.pp.make_parquet#

gatac.pp.make_parquet(input_path, output_path=None, separator='\t', barcode_prefix=None, row_group_size=1000000)#

Convert ATAC fragments TSV.GZ file to Parquet format.

Uses DuckDB for parallel decompression and parsing, enabling processing of files larger than RAM.

Parameters:
input_path str or Path

Path to input .tsv.gz (or .bed.gz) file with ATAC fragments. Expected columns: chrom, start, end, barcode, count

output_path str or Path, optional

Path for output .parquet file. If None, uses input filename with .parquet extension.

separator str

Column separator (default: tab)

barcode_prefix str, optional

Prefix to prepend to barcodes (e.g. “sample1#”)

row_group_size int

Number of rows per Parquet row group (default: 1_000_000). Larger groups reduce GPU kernel-launch overhead when streaming via cudf.read_parquet(row_groups=…). The default is tuned for the gatac batch size of 64 row groups (~64 M rows/batch fits comfortably on a 12 GB GPU).

Returns:

Path Path to the created Parquet file.

Return type:

Path

Examples

>>> import gatac as ga
>>> out = ga.pp.make_parquet("pbmc.tsv.gz")
>>> # Or with a sample-specific barcode prefix
>>> out = ga.pp.make_parquet("pbmc.tsv.gz", barcode_prefix="sample1#")