Spectral Embedding, Clustering & Peak Calling#
import gatac as ga
import scanpy as sc
import rapids_singlecell as rsc
import anndata as ad
import time
sc.set_figure_params(dpi=100, frameon=False)
adata = ad.read_h5ad("data/pbmc_tile.h5ad")
print(adata)
AnnData object with n_obs × n_vars = 5001 × 6062095
obs: 'barcode', 'n_unique', 'tsse_score', 'duplicate_fraction', 'mito_fraction'
var: 'chrom', 'start', 'end', 'selected', 'accessibility_count'
1. Spectral embedding#
Compute a spectral decomposition of the TF-IDF-normalised cell × tile matrix.
The resulting components are stored in adata.obsm["X_spectral"].
t0 = time.perf_counter()
ga.tl.spectral(adata, n_comps=30)
print(f"Spectral embedding in {time.perf_counter() - t0:.1f}s")
print(f"Embedding shape: {adata.obsm['X_spectral'].shape}")
15:20:49 - gatac.tl.spectral - INFO - Running GPU-accelerated spectral embedding ...
15:20:51 - gatac.tl.spectral - INFO - IDF weights computed.
15:20:51 - gatac.tl.spectral - INFO - Row normalization done.
15:20:51 - gatac.tl.spectral - INFO - Eigen-decomposition done.
15:20:51 - gatac.tl.spectral - INFO - Spectral embedding complete: 30 components (top eigenvalue = 1.0000).
Spectral embedding in 1.9s
Embedding shape: (5001, 30)
2. UMAP & Leiden clustering#
3. Call peaks per cluster#
We are running for each Leiden cluster a gpu accelerated implementation of MACS3 (gmacs).
t0 = time.perf_counter()
ga.tl.call_peaks(
adata,
groupby="leiden",
parquet_path="data/pbmc_fragments.parquet",
genome="hg38",
)
print(f"Peak calling in {time.perf_counter() - t0:.1f}s")
15:21:05 - gatac.tl.peaks - INFO - Using genome length: 3,088,269,832 bp
15:21:05 - gatac.tl.peaks - INFO - Calling peaks for 13 groups in 'leiden'
15:21:05 - gatac.tl.peaks - INFO - Processing group: 5
15:21:08 - gatac.tl.peaks - INFO - 5,000,812 fragments, calling peaks...
15:21:30 - gatac.tl.peaks - INFO - Found 45,079 peaks
15:21:31 - gatac.tl.peaks - INFO - Processing group: 4
15:21:33 - gatac.tl.peaks - INFO - 9,778,471 fragments, calling peaks...
15:21:56 - gatac.tl.peaks - INFO - Found 56,580 peaks
15:21:56 - gatac.tl.peaks - INFO - Processing group: 6
15:21:58 - gatac.tl.peaks - INFO - 3,703,051 fragments, calling peaks...
15:22:20 - gatac.tl.peaks - INFO - Found 70,325 peaks
15:22:20 - gatac.tl.peaks - INFO - Processing group: 8
15:22:22 - gatac.tl.peaks - INFO - 9,735,768 fragments, calling peaks...
15:22:45 - gatac.tl.peaks - INFO - Found 52,531 peaks
15:22:45 - gatac.tl.peaks - INFO - Processing group: 3
15:22:47 - gatac.tl.peaks - INFO - 9,769,370 fragments, calling peaks...
15:23:09 - gatac.tl.peaks - INFO - Found 53,840 peaks
15:23:10 - gatac.tl.peaks - INFO - Processing group: 9
15:23:11 - gatac.tl.peaks - INFO - 2,579,955 fragments, calling peaks...
15:23:32 - gatac.tl.peaks - INFO - Found 53,778 peaks
15:23:32 - gatac.tl.peaks - INFO - Processing group: 1
15:23:34 - gatac.tl.peaks - INFO - 4,010,473 fragments, calling peaks...
15:23:55 - gatac.tl.peaks - INFO - Found 60,363 peaks
15:23:55 - gatac.tl.peaks - INFO - Processing group: 12
15:23:57 - gatac.tl.peaks - INFO - 2,578,345 fragments, calling peaks...
15:24:17 - gatac.tl.peaks - INFO - Found 71,171 peaks
15:24:17 - gatac.tl.peaks - INFO - Processing group: 7
15:24:20 - gatac.tl.peaks - INFO - 13,759,752 fragments, calling peaks...
15:24:42 - gatac.tl.peaks - INFO - Found 100,060 peaks
15:24:42 - gatac.tl.peaks - INFO - Processing group: 10
15:24:45 - gatac.tl.peaks - INFO - 8,225,915 fragments, calling peaks...
15:25:09 - gatac.tl.peaks - INFO - Found 80,956 peaks
15:25:09 - gatac.tl.peaks - INFO - Processing group: 11
15:25:11 - gatac.tl.peaks - INFO - 2,982,472 fragments, calling peaks...
15:25:32 - gatac.tl.peaks - INFO - Found 77,834 peaks
15:25:32 - gatac.tl.peaks - INFO - Processing group: 0
15:25:34 - gatac.tl.peaks - INFO - 4,367,787 fragments, calling peaks...
15:25:55 - gatac.tl.peaks - INFO - Found 71,424 peaks
15:25:56 - gatac.tl.peaks - INFO - Processing group: 2
15:25:57 - gatac.tl.peaks - INFO - 4,291,208 fragments, calling peaks...
15:26:19 - gatac.tl.peaks - INFO - Found 56,955 peaks
Peak calling in 314.3s
# Number of peaks per cluster
for group, peak_df in adata.uns["gmacs"].items():
print(f" Cluster {group}: {len(peak_df):,} peaks")
Cluster 5: 45,079 peaks
Cluster 4: 56,580 peaks
Cluster 6: 70,325 peaks
Cluster 8: 52,531 peaks
Cluster 3: 53,840 peaks
Cluster 9: 53,778 peaks
Cluster 1: 60,363 peaks
Cluster 12: 71,171 peaks
Cluster 7: 100,060 peaks
Cluster 10: 80,956 peaks
Cluster 11: 77,834 peaks
Cluster 0: 71,424 peaks
Cluster 2: 56,955 peaks
4. Merge peaks across clusters#
ga.tl.merge_peaks(adata,chrom_sizes="hg38")
adata.uns["peaks"]
| chrom | start | end | name | score | strand | signal_value | p_value | q_value | peak | |
|---|---|---|---|---|---|---|---|---|---|---|
| 171986 | chr1 | 778385 | 778886 | . | 1000 | . | 18.646694 | 1000.000000 | 1000.000000 | 250 |
| 45085 | chr1 | 827219 | 827720 | . | 1000 | . | 22.037284 | 1000.000000 | 1000.000000 | 250 |
| 171990 | chr1 | 869638 | 870139 | . | 1000 | . | 32.572115 | 1000.000000 | 1000.000000 | 250 |
| 463684 | chr1 | 959011 | 959512 | . | 1000 | . | 20.804989 | 1000.000000 | 1000.000000 | 250 |
| 463694 | chr1 | 1013132 | 1013633 | . | 1000 | . | 20.022753 | 1000.000000 | 1000.000000 | 250 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 742801 | chr14 | 66718232 | 66718733 | . | 13 | . | 3.048780 | 3.278076 | 1.328900 | 250 |
| 743289 | chr14 | 79501649 | 79502150 | . | 13 | . | 3.048780 | 3.278076 | 1.328900 | 250 |
| 755741 | chr19 | 32997620 | 32998121 | . | 13 | . | 3.048780 | 3.278076 | 1.328900 | 250 |
| 776568 | chr5 | 118732321 | 118732822 | . | 13 | . | 3.048780 | 3.278076 | 1.328900 | 250 |
| 165232 | chr8 | 60483580 | 60484081 | . | 13 | . | 3.048780 | 3.278076 | 1.302123 | 250 |
178350 rows × 10 columns
5. Build cell × peak matrix#
t0 = time.perf_counter()
peak_adata = ga.tl.make_peak_matrix(
adata,parquet_path="data/pbmc_fragments.parquet",
)
print(f"Peak matrix built in {time.perf_counter() - t0:.1f}s")
print(peak_adata)
15:26:48 - gatac.tl.peaks - INFO - Counting fragments in 178,350 peaks
15:26:48 - gatac.tl.peaks - INFO - Processing single parquet file
15:26:59 - gatac.pp.gene - INFO - Using uint8 dtype (max count: 70)
15:26:59 - gatac.tl.peaks - INFO - Peak matrix: 5,001 cells × 178,350 peaks
15:26:59 - gatac.tl.peaks - INFO - Sparsity: 95.17%
Peak matrix built in 11.2s
AnnData object with n_obs × n_vars = 5001 × 178350
obs: 'barcode', 'n_unique', 'tsse_score', 'duplicate_fraction', 'mito_fraction', 'leiden'
6. Save#
peak_adata.write_h5ad("data/pbmc_peaks.h5ad")
print("Saved → data/pbmc_peaks.h5ad")
Saved → data/pbmc_peaks.h5ad
Next steps#
03 – Motif enrichment
04 – chromVAR