gatac.pp.make_parquet_batch

gatac.pp.make_parquet_batch#

gatac.pp.make_parquet_batch(input_paths, output_dir=None, workers=None, separator='\t', barcode_prefix=None, row_group_size=1000000)#

Convert multiple ATAC fragment TSV.GZ files to Parquet in parallel.

Each file is processed in a separate worker process, so DuckDB can use all available CPU cores across files simultaneously.

Parameters:

input_paths list of str or Path: Paths to input .tsv.gz (or .bed.gz) files.
output_dir str or Path, optional: Directory for output Parquet files. If None, each output is placed in the same directory as its input.
workers int, optional: Number of parallel worker processes. Defaults to min(len(input_paths), os.cpu_count()).
separator str: Column separator forwarded to make_parquet().
barcode_prefix str, optional: Prefix forwarded to make_parquet().
row_group_size int: Row-group size forwarded to make_parquet().

Returns:

list of Path Output Parquet paths in the same order as input_paths.

Raises:

Exception – Re-raises the first worker exception encountered so the caller can handle it.

Return type:

list[Path]

Examples

>>> import gatac as ga
>>> # Convert multiple samples in parallel
>>> paths = ga.pp.make_parquet_batch(
...     ["sampleA.tsv.gz", "sampleB.tsv.gz"],
...     output_dir="parquet/",
... )
>>> # With per-sample barcode prefixes (must match input order)
>>> paths = ga.pp.make_parquet_batch(
...     ["sampleA.tsv.gz", "sampleB.tsv.gz"],
...     barcode_prefix=["A_", "B_"],
... )

gatac.pp.make_parquet_batch

Contents

gatac.pp.make_parquet_batch#