gatac.pp.make_parquet_batch#
- gatac.pp.make_parquet_batch(input_paths, output_dir=None, workers=None, separator='\t', barcode_prefix=None, row_group_size=1000000)#
Convert multiple ATAC fragment TSV.GZ files to Parquet in parallel.
Each file is processed in a separate worker process, so DuckDB can use all available CPU cores across files simultaneously.
- Parameters:
- input_paths
listofstrorPath Paths to input .tsv.gz (or .bed.gz) files.
- output_dir
strorPath, optional Directory for output Parquet files. If None, each output is placed in the same directory as its input.
- workers
int, optional Number of parallel worker processes. Defaults to
min(len(input_paths), os.cpu_count()).- separator
str Column separator forwarded to
make_parquet().- barcode_prefix
str, optional Prefix forwarded to
make_parquet().- row_group_size
int Row-group size forwarded to
make_parquet().
- input_paths
- Returns:
list of Path Output Parquet paths in the same order as input_paths.
- Raises:
Exception – Re-raises the first worker exception encountered so the caller can handle it.
- Return type:
Examples
>>> import gatac as ga >>> # Convert multiple samples in parallel >>> paths = ga.pp.make_parquet_batch( ... ["sampleA.tsv.gz", "sampleB.tsv.gz"], ... output_dir="parquet/", ... ) >>> # With per-sample barcode prefixes (must match input order) >>> paths = ga.pp.make_parquet_batch( ... ["sampleA.tsv.gz", "sampleB.tsv.gz"], ... barcode_prefix=["A_", "B_"], ... )