gatac.pp.make_parquet_batch

gatac.pp.make_parquet_batch#

gatac.pp.make_parquet_batch(input_paths, output_dir=None, workers=None, separator='\t', barcode_prefix=None, row_group_size=1000000)#

Convert multiple ATAC fragment TSV.GZ files to Parquet in parallel.

Each file is processed in a separate worker process, so DuckDB can use all available CPU cores across files simultaneously.

Parameters:
input_paths list of str or Path

Paths to input .tsv.gz (or .bed.gz) files.

output_dir str or Path, optional

Directory for output Parquet files. If None, each output is placed in the same directory as its input.

workers int, optional

Number of parallel worker processes. Defaults to min(len(input_paths), os.cpu_count()).

separator str

Column separator forwarded to make_parquet().

barcode_prefix str, optional

Prefix forwarded to make_parquet().

row_group_size int

Row-group size forwarded to make_parquet().

Returns:

list of Path Output Parquet paths in the same order as input_paths.

Raises:

Exception – Re-raises the first worker exception encountered so the caller can handle it.

Return type:

list[Path]

Examples

>>> import gatac as ga
>>> # Convert multiple samples in parallel
>>> paths = ga.pp.make_parquet_batch(
...     ["sampleA.tsv.gz", "sampleB.tsv.gz"],
...     output_dir="parquet/",
... )
>>> # With per-sample barcode prefixes (must match input order)
>>> paths = ga.pp.make_parquet_batch(
...     ["sampleA.tsv.gz", "sampleB.tsv.gz"],
...     barcode_prefix=["A_", "B_"],
... )