gatac convert#
Convert raw ATAC-seq fragment files from TSV.GZ to Parquet format. Uses DuckDB for parallel decompression and Parquet row-group sizing tuned for efficient GPU streaming.
Synopsis#
gatac convert <input> [--output OUTPUT] [--output-dir DIR]
[-j WORKERS] [--barcode-prefix PREFIX]
[--row-group-size N] [--separator SEP]
Arguments#
Positional#
Argument |
Description |
|---|---|
|
Path (or glob pattern) to one or more TSV.GZ fragment files |
Options#
Flag |
Default |
Description |
|---|---|---|
|
|
Output Parquet file path (single-file mode only) |
|
same dir as input |
Directory for output files in batch mode |
|
CPU count |
Worker processes for parallel batch conversion |
|
— |
String prepended to every barcode (e.g., |
|
|
Row-group size written to Parquet |
|
|
Field separator in the input TSV |
Input format#
Expected TSV columns (no header):
chrom start end barcode count
chr1 10001 10200 ATCG... 1
Compressed with gzip (.tsv.gz) or uncompressed (.tsv) are both supported.
Examples#
Single file#
gatac convert pbmc_fragments.tsv.gz
# → pbmc_fragments.parquet
Single file with explicit output#
gatac convert pbmc_fragments.tsv.gz --output pbmc.parquet
Batch conversion with barcode prefixes#
gatac convert "data/*/fragments.tsv.gz" \
--output-dir parquets/ \
--barcode-prefix "$(basename {})"
Parallel conversion with N workers#
gatac convert "samples/*.tsv.gz" --output-dir out/ -j 8
Python equivalent#
import gatac as ga
# Single file
ga.pp.make_parquet("pbmc_fragments.tsv.gz", barcode_prefix="PBMC_")
# Batch
ga.pp.make_parquet_batch(
["sampleA.tsv.gz", "sampleB.tsv.gz"],
output_dir="parquets/",
workers=4,
)