gatac features#
GPU-accelerated selection of the most accessible genomic features from one or more tile-matrix h5ad files. Uses quantile-based filtering following the ArchR approach for binary matrices.
Synopsis#
gatac features <input.h5ad> [-n N_FEATURES]
[-o OUTPUT] [--no-binarize]
[--filter-lower QUANTILE] [--filter-upper QUANTILE]
Arguments#
Positional#
Argument |
Description |
|---|---|
|
Path(s) or glob to h5ad tile-matrix file(s) |
Options#
Flag |
Default |
Description |
|---|---|---|
|
|
Number of features to retain |
|
in-place or |
Output path |
|
— |
Skip binarization of the output matrix |
|
|
Lower quantile cutoff (removes very rare features) |
|
|
Upper quantile cutoff (removes ubiquitously open features) |
Algorithm#
For binary matrices (standard tile matrices), GATAC:
Computes per-feature accessibility counts (sum of binarized matrix).
Filters out features in the lower
filter_lower_quantileand upperfilter_upper_quantilequantile of accessibility.Selects the top
n_featuresmost accessible features from the remaining set.
For count matrices, the same procedure applies but on raw counts.
Multi-file streaming#
When multiple h5ad files (or a glob) are provided, GATAC performs streaming feature selection:
Feature counts are accumulated file-by-file without loading all data into memory simultaneously.
A single set of
n_featuresis selected across the union of all features.Output is a combined h5ad file.
This is the recommended approach for large multi-sample studies.
Examples#
Single file (in-place)#
gatac features tile_matrix.h5ad -n 500000
Single file, save to new path#
gatac features tile_matrix.h5ad -n 200000 -o tile_selected.h5ad
Multi-sample streaming#
gatac features "data/*.h5ad" -n 500000 -o combined_selected.h5ad
Loose filtering (keep more rare/ubiquitous features)#
gatac features tile_matrix.h5ad \
-n 500000 \
--filter-lower 0.001 \
--filter-upper 0.001
Python equivalent#
import gatac as ga
# Single file
ga.pp.select_features(adata, n_features=500_000)
# Multi-file streaming
ga.pp.select_features_multi(
["sampleA.h5ad", "sampleB.h5ad"],
output_path="combined.h5ad",
n_features=500_000,
binarize=True,
)
Output AnnData changes#
Slot |
Content |
|---|---|
|
Boolean mask of selected features |
|
Per-feature accessibility count |
After feature selection, downstream tools automatically subset to
adata.var["selected"] == True unless the full matrix is requested.