gatac.tl.spectral

Contents

gatac.tl.spectral#

gatac.tl.spectral(adata, n_comps=30, features='selected', random_state=0, weighted_by_sd=True, feature_weights=None, inplace=True, chunk_size=None)#

GPU-accelerated spectral embedding via Laplacian Eigenmaps.

Converts the cell × feature count matrix into a lower-dimensional representation using the spectrum of the normalized graph Laplacian defined by pairwise cosine similarity between cells. The entire computation runs on the GPU via CuPy, using a matrix-free approach that never materializes the N × N similarity matrix.

This is a GPU-accelerated port of SnapATAC2’s tl.spectral.

Parameters:
adata AnnData

AnnData object. adata.X should be a sparse cell × tile (or peak) matrix — binarized or count-valued.

n_comps int

Number of spectral dimensions to compute. When weighted_by_sd=True (default) the result is insensitive to this value as long as it is large enough (e.g. 30).

features str | ndarray | None

Which features (columns) to use. "selected" uses the boolean mask in adata.var["selected"] and requires a prior call to pp.select_features. You can also pass a NumPy boolean array of length n_vars or None to use all features.

random_state int

Seed for reproducibility of the Lanczos starting vector.

weighted_by_sd bool

If True (default), weight each eigenvector by the square root of its eigenvalue and discard components with non-positive eigenvalues. This typically eliminates the need to manually choose the number of components.

feature_weights ndarray | None

Optional per-feature IDF weights. If None, IDF weights are computed automatically from the data.

inplace bool

If True, store the embedding in adata.obsm["X_spectral"] and eigenvalues in adata.uns["spectral_eigenvalue"]. If False, return (eigenvalues, eigenvectors) as numpy arrays.

chunk_size int | None

When set, process the matrix in row-batches of this many cells instead of loading the full matrix into GPU memory at once. Only one chunk resides on the GPU at a time during each Lanczos iteration, trading throughput for reduced peak GPU memory. Recommended values: 20 000 – 50 000. None (default) loads the full matrix to the GPU (fastest, but requires enough VRAM).

Returns:

tuple[np.ndarray, np.ndarray] | None If inplace=True: stores results in adata and returns None. If inplace=False: returns (eigenvalues, eigenvectors).

Return type:

tuple[ndarray, ndarray] | None

Notes

In the default full-GPU path, the matrix is uploaded in its original integer dtype and only X.data is converted to float32 in place during normalization. The sparse index arrays stay int32, and peak GPU memory is roughly one float32 copy of the data array.

When chunk_size is set, peak GPU memory is roughly one chunk. The full matrix stays on the CPU, row chunks are streamed to the GPU for each eigsh matvec, and throughput is lower because each matvec requires two passes over the streamed chunks.

The algorithm:

  1. Apply IDF weights and L2-normalize each row of the selected feature matrix so that row dot-products equal cosine similarities.

  2. Define a matrix-free linear operator A v = X (X^T v) - D v where D is the degree vector (sum of cosine similarities per cell). This implicitly represents D^{-1/2} (S - I) D^{-1/2} with S = X X^T.

  3. Compute the top-k eigenpairs via CuPy’s Lanczos (eigsh).

  4. Optionally weight eigenvectors by sqrt(eigenvalue).

Examples

>>> import gatac
>>> # After tile matrix creation and feature selection:
>>> gatac.pp.select_features(adata)
>>> gatac.tl.spectral(adata)
>>> adata.obsm["X_spectral"].shape
(n_cells, n_effective_comps)

For large datasets that cause GPU OOM:

>>> gatac.tl.spectral(adata, chunk_size=30_000)