Processing Alevin bootstraps for Seurat integration

When quantifying single-cell transcript expression, fragment assignment ambiguity and multi-mapping reads can introduce inferential variance in estimated counts. Salmon/Alevin can output per‑cell bootstrap replicates that capture inferential uncertainty due to fragments mapping to multiple transcripts of the same gene^1–3. This uncertainty manifests as overdispersion which, if uncorrected, inflates variability estimates and distorts downstream analyses (differential expression, clustering, trajectory inference)⁴.

Here we present scAmbi, an integrated approach that (i) estimates per‑transcript overdispersion from Alevin bootstraps in a sparse‑aware way, (ii) combines that estimate with moment‑based and entropy-based complexity priors for further exploration, and (iii) applies the boostrap-based overdispersion result either by building a corrected assay (RNA_corr) or by supplying model offsets. We then evaluate the impact using BCV (biological coefficient of variation) analyses at the between‑sample and within‑sample (cell‑to‑cell) levels.

How this relates to `edgeR::catchSalmon()`

catchSalmon (bulk) computes per‑transcript technical OD from bootstraps across samples, applies EB moderation with a floor at 1, and analyzes scaled counts. Our approach is analogous but uses cells as replicates. Additionally, scAmbi adds a moment‑based component, an annotation prior, and an integrated overdispersion estimate (boostrap + moment-based + annotation prior) for exploration, and supports offset‑based workflows in addition to a corrected assay.

What’s unique about this package

OD_integrated: weighted geometric fusion of three signals: bootstrap‑based OD, moment‑based OD on normalized counts, and an entropy-based complexity prior, followed by adaptive shrinkage.
Sparse‑aware C++ kernel: blockwise processing of the (transcripts × cells × bootstraps) matrix, strict bounds checking, and efficient dgCMatrix traversal.
Flexible correction: use RNA_corr directly, or supply offsets constructed from (a) corrected/raw ratios, (b) OD vectors, or (c) any feature vector stored in feature_meta.
BCV analysis suite: between‑sample (pseudobulk across samples) and within‑sample (cell grouping to pseudobulk) pipelines with matched plotting and summaries.

Dependencies

Required R packages

This package requires R version 4.2 or higher.

Core dependencies

Seurat (>= 4.0.0): Single-cell data structures and workflows
Matrix (>= 1.3-0): Sparse matrix operations
edgeR (>= 3.34.0): Dispersion estimation and BCV calculations
Rcpp (>= 1.0.7): C++ integration for performance-critical operations

Data I/O and processing

eds (>= 1.0.0): Reading Alevin EDS sparse matrix format
tximport (>= 1.20.0): Transcript-level quantification import
rtracklayer (>= 1.52.0): GTF file parsing
jsonlite (>= 1.7.0): JSON data handling

Visualization

ggplot2 (>= 3.3.0): Core plotting framework
patchwork (>= 1.1.0): Combining multiple plots
dplyr (>= 1.0.0): Data manipulation for summaries
tidyr (>= 1.1.0): Data reshaping for visualization

Parallel processing

parallel: Built-in R package for multi-core processing

Development

These packages are necessary for building the vignettes and running tests. You can install them by running the following command:

install.packages(c("knitr", "rmarkdown", "testthat"))

System requirements

C++ compiler: Required for compiling Rcpp functions (e.g., g++ >= 7.0 or clang >= 4.0)
OpenMP (optional): For additional parallelization in C++ code
Memory: Minimum 64GB RAM recommended for typical datasets (4-8 samples, ~1M cells each)
Storage: Sufficient space for Alevin bootstrap files (can be several GB per sample)

Installation

From GitHub (recommended)

# install.packages("remotes")
remotes::install_github("sbresnahan/scAmbi")

From a local source tarball/zip

# If you have scAmbi.zip or a source tar.gz:
install.packages("scAmbi.zip", repos = NULL, type = "source")
# or use devtools:
# devtools::install("path/to/scAmbi/")

Notation and variance model

The overdispersion correction approach implemented in scAmbi is motivated by the observation that transcript-level quantifications exhibit increased technical variance due to read-to-transcript ambiguity (RTA). While reads can typically be assigned unambiguously to genes, they are often compatible with multiple transcripts for the same gene, particularly for genes with many isoforms. This assignment uncertainty disrupts the mean-variance relationship normally observed in RNA-seq data and interferes with standard differential expression methods.

For gene (or transcript) $g$ in cell $i$: counts $y_{gi}$, library size $N_i$, underlying proportion $\pi_{gi}$, mean $\mu_{gi}=N_i\pi_{gi}$. Read-to-transcript ambiguity inflates technical variance beyond Poisson:

$\mathbb{E}(y_{gi}\mid\pi_{gi})=\mu_{gi}, \qquad \mathrm{Var}(y_{gi}\mid\pi_{gi})=\sigma_g^2\,\mu_{gi}, \; \sigma_g^2>1$

With biological variation across cells, $\pi_{gi}$ varies and marginal variance becomes

\[\mathrm{Var}(y_{gi})\;=\;\sigma_g^2\,\mu_{gi}\; +\; \phi_g\,\mu_{gi}^2,\]

with $\phi_g$ the NB-like biological dispersion⁴. The key insight is that the technical overdispersion parameter $\sigma_g^2$ can be estimated from bootstrap samples and divided out of the transcript counts, yielding scaled counts that follow the standard negative binomial mean-variance relationship and can be analyzed with established differential expression methods.

Integrated overdispersion estimation

By default, scAmbi uses a bootstrap-based overdispersion estimate following Baldoni et al.⁴ for correction. For exploratory purposes, it also computes two additional components per transcript and an integrated estimate by combining the three components on the log scale.

1) Bootstrap-based component (sparse-aware)

Let $u_{tib}$ be the Alevin bootstrap count for transcript $t$, cell $i$, bootstrap $b = 1..B$.
For each cell we traverse the sparse bootstrap columns once, accumulating for each transcript:

Total sum $\sum_b u_{tib}$ and sum of squares $\sum_b u_{tib}^2$ over all $B$ replicates.
A “seen” flag indicating at least one non-zero bootstrap value for that transcript in that cell.

For any transcript marked as seen in a cell, we set $k = B$ (all replicates, zeros included), compute:

\[ \bar{u}_{ti} = \frac{1}{B} \sum_{b=1}^B u_{tib}, \quad s^2_{ti} = \frac{1}{B - 1} \sum_{b=1}^B \left(u_{tib} - \bar{u}_{ti}\right)^2 \]

and define the per-cell bootstrap OD inflation as:

\[ \mathrm{OD}_{t,i} = 1 + \frac{s^2_{ti}}{\bar{u}_{ti} + \varepsilon}. \]

We then accumulate across cells:
$\mathrm{OD\_sum}_t = \sum_i \mathrm{OD}_{t,i}$,
$\mathrm{DF}_t =$ number of contributing cells,
$e_t =$ number of expressing cells (same as $\mathrm{DF}_t$ here).

The raw transcript-level bootstrap OD is:

\[ \widehat{\mathrm{OD}}^{\mathrm{boot}}_t = \frac{\mathrm{OD\_sum}_t}{\mathrm{DF}_t}. \]

EB moderation. Let the prior be the median of values $>1$ with prior df $d_0=10$:

\[ \mathrm{OD}^{\mathrm{boot}}_t = \frac{d_0\,\mathrm{Prior} + \mathrm{DF}_t\,\widehat{\mathrm{OD}}^{\mathrm{boot}}_t}{d_0 + \mathrm{DF}_t}, \]

clamped to $\ge 1$. Transcripts with $e_t$ below the expression threshold default to 1.

Notes:
1. We use $k = B$ for seen transcripts and compute an inflation factor $1 + s^2/\mu$, ensuring that any non-zero bootstrap variance produces $\mathrm{OD} > 1$.
2. If gene-level OD is desired, aggregate after estimating transcript-level OD (e.g., gene-wise median or counts-weighted mean of transcript ODs).
3. $\mathrm{OD}^{\mathrm{boot}}_t$ is the default OD estimate for correction.

2) Moment-based component

The moment-based overdispersion estimate captures gene-specific variability patterns by comparing the observed variance to what would be expected under a Poisson model. Sequencing and amplification introduce technical noise that scales with expression level, while biological processes create additional variance through cell-to-cell differences in transcriptional state, cell cycle progression, and stochastic gene expression. Furthermore, low-abundance genes may be captured inconsistently across cells due to dropout effects, inflating their apparent variability.

The raw moment-based estimate uses counts across expressing cells only (non-zeros):

\[\widehat{\mathrm{OD}}^{\mathrm{mom}}_g = \frac{\mathrm{Var}(x_g) + \epsilon}{\mathbb{E}(x_g) + \epsilon}\]

where $\epsilon$ provides numerical stabilization. When rel_eps > 0, both epsilon terms equal rel_eps * mean. When rel_eps = 0 (default), both epsilon terms equal abs_eps (default $10^{-8}$). Using only expressing cells avoids bias from technical zeros while capturing the true biological variability among cells that detectably express the gene.

To account for estimation uncertainty with small sample sizes, confidence-weighted shrinkage is applied toward the null value of 1:

\[\mathrm{OD}^{\mathrm{mom}}_g = \max\!\left(1,\;1+\frac{n_g}{n_g+k}\big(\widehat{\mathrm{OD}}^{\mathrm{mom}}_g-1\big)\right)\]

where $n_g$ is the number of expressing cells and $k$ is the shrinkage parameter (default 30). This shrinkage is stronger for genes with fewer expressing cells, reflecting lower confidence in variance estimates from small samples.

A key consideration is that this approach may reduce some genuine biological signal along with technical noise, particularly for genes undergoing continuous state transitions or exhibiting meaningful transcriptional heterogeneity. To address this, the complexity prior (Section 3) provides gene-specific adjustments that account for expected biological variability in structurally complex genes.

3) Complexity prior (entropy-based)

The complexity prior is biologically motivated by the observation that genes with multiple actively expressed isoforms exhibit increased expression variability due to cell-to-cell differences in splicing regulation and quantification uncertainty when similar isoforms compete for read assignment. The prior combines isoform count with expression evenness to capture this increased variability. For each gene $g$, let $k_g$ be the number of expressed isoforms (transcripts with mean expression above threshold) and $E_g$ be the normalized Shannon entropy of isoform expression:

\[E_g = \frac{H_g}{\log(k_g)} = \frac{-\sum_{i} p_i \log(p_i)}{\log(k_g)}\]

where $p_i$ is the proportion of total gene expression from isoform $i$. The complexity prior for each transcript is then:

\[\mathrm{OD}^{\mathrm{prior}}_t = 1 + \alpha \cdot \log\big(1 + \max(k_g - 1, 0) \times \max(E_g, 0)\big)\]

where $\alpha$ is a scaling parameter (default 0.6). This gives higher prior values to transcripts from genes with more expressed isoforms that have relatively even expression levels. Transcripts from single-isoform genes or unmapped transcripts receive a prior of 1.

Figure 1

Figure 2

4) Geometric fusion + adaptive shrinkage

With weights $w_b=0.8$, $w_m=0.1$, $w_p=0.1$:

\[ \log \mathrm{OD}^{\mathrm{int}}_t = w_b\log\mathrm{OD}^{\mathrm{boot}}_t + w_m\log\mathrm{OD}^{\mathrm{mom}}_t + w_p\log\mathrm{OD}^{\mathrm{prior}}_t. \]

For additional stability we shrink toward the median of well-observed transcripts (e.g., $e_t \ge 50$) with weight $\alpha_t = \min(e_t/100,1)$:

\[ \mathrm{OD}^{\mathrm{int,final}}_t = \alpha_t\,\mathrm{OD}^{\mathrm{int}}_t + (1-\alpha_t)\,\mathrm{median}\big\{\mathrm{OD}^{\mathrm{int}}_{e \ge 50}\big\}, \]

then clamp to $[1,100]$.

Applying the correction: assays and offsets

We expose multiple, interchangeable ways to use the OD estimates.

Note: $\mathrm{OD}^{\mathrm{boot}}_t$ is the default OD estimate for correction.

A) Build a corrected assay (`RNA_corr`)

Form a left-diagonal scaling with entries $1/\mathrm{OD}^{\mathrm{boot}}_t$:

\[ Z = \mathrm{diag}\left(\frac{1}{\mathrm{OD}^{\mathrm{boot}}}\right)\,Y, \]

and store it as a second assay RNA_corr alongside the raw RNA counts. Feature-level metadata (feature_meta) is stored in both assays with columns:

OverDisp_integrated, OverDisp_bootstrap, OverDisp_moments, OverDisp_prior
expressing_cells, scaling_factor (= OverDisp_integrated), inv_scaling (= 1/scaling_factor)

B) Use model offsets (no second assay required)

When fitting NB-GLMs (e.g., edgeR), you can supply transcript- and sample-specific offsets instead of modifying counts.

Ratio offsets (preferred when RNA_corr exists): Use $\log\big(\text{corrected}/\text{raw}\big)$ per transcript × sample, i.e., offsets from the pseudobulk ratio pb_corr/pb_raw.
OD vector offsets: Treat an OD-like vector as a per-transcript multiplicative factor and add $\log\mathrm{OD}$ to the offset for every sample (same transcripts).
Arbitrary feature vectors: Any column in feature_meta can be used. Interpret as:
- vector_as = "od" for OverDisp_* or scaling_factor.
- vector_as = "ratio" for inv_scaling (since RNA_corr = RNA * inv_scaling).

All methods are implemented in calculate_bcv() / analyze_bcv_comprehensive() via offset_method = "ratio" | "vector" | "overdispersion".

BCV analyses

We quantify how correction impacts variability using BCV from edgeR in two complementary modes.

1) Between‑sample BCV (pseudobulk across samples)

Build a transcripts × samples pseudobulk by summing counts per sample.
Option A - Direct: run BCV on RNA_corr pseudobulk. Option B - Offsets: run BCV on raw pseudobulk but add offsets:
- offset_ratio: from pb_corr/pb_raw (transcript × sample).
- offset_od: from an OD vector (applied to all samples).
- offset_vector: from any feature_meta column.
Plot tagwise BCV vs AveLogCPM with common and trended curves; compare RNA vs corrected.

Interpretation. Post‑correction, BCV typically decreases for ambiguous features and the abundance‑dependent trend stabilizes.

2) Within‑sample (cell‑to‑cell) BCV

Randomly partition cells within a sample into $G$ groups.
Build a transcripts × groups pseudobulk by summing counts per group.
Compute BCV per sample. Offsets may be added using any feature vector, mirroring the between‑sample options.
Summarize across samples and visualize (per‑sample plots and an overall summary).

Interpretation. A successful correction reduces within‑sample (cell‑to‑cell) BCV, especially for highly ambiguous transcripts, without artificially flattening true biological variability.

Dataset and Alevin parameters

For the example dataset (four healthy term placenta samples) and Alevin command‑line options, see the original sections below (kept for continuity).

Dataset for example analysis

For this tutorial, we use scRNA‑seq from four healthy term placentas,⁵ specifically: SRR14134833, SRR14134834, SRR14134836, and SRR15193608. Libraries consisting of approximately 4.8M cells across these four samples were prepared using the 10x Genomics Single Cell 3′ Reagent Kit v3 and sequenced on an Illumina NovaSeq platform. Reads were trimmed with fastp and down-sampled to 200M reads each. Note these are 3’-scRNA-seq reads (Chromium v3) so there should be considerably less overdispersion than non-3’-selected preps.

Quantification in this tutorial uses our long‑read defined, placenta‑specific transcriptome reference⁶. Villous placenta RNA was used to prepare ONT cDNA libraries (PCS111), with raw signals basecalled by Guppy. Full‑length cDNA reads were oriented and trimmed with Pychopper2, then corrected for microindels using FMLRC2 with matched short‑read placenta RNA‑seq. Corrected reads were aligned to hg38 using minimap2; primary alignments across all samples were concatenated and assembled de novo with ESPRESSO. The assembly was characterized with SQANTI3 against GENCODE v45 to assign structural classes, and high‑confidence isoforms were identified using multiple orthogonal support signals including long‑read coverage, short‑read splice‑junction support, and proximity of TSS/TTS to placenta CAGE‑seq, DNase‑seq, and SAPAS sites.

Alevin parameters

salmon alevin \
  -l ISR \
  -1 example_1.fastq.gz \
  -2 example_2.fastq.gz \
  --chromiumV3 \
  -i index/transcripts \
  -p 10 \
  --whitelist index/3M-february-2018.txt \
  --numCellBootstraps 20 \
  --dumpFeatures \
  -o quants/example \
  --tgMap index/tx2g.tsv  # tx-to-tx identity table

Key options: --numCellBootstraps (bootstrap count), --dumpFeatures (emit quants_boot_mat.gz), and --tgMap (transcript‑to‑self for transcript‑level quantification).

Minimal usage example

Step 1: read data and compute integrated overdispersion for all samples

Set paths

dir_out  <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/analysis"
dir_data <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/salmon/quants"
SRA <- c("SRR14134833_test","SRR14134834_test","SRR14134836_test","SRR15193608_test")

gtf_file <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/salmon/ESPRESSO_corrected_SC_filtered.gtf"
transcript_info <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/salmon/tx2g_header.tsv"

files <- paste(dir_data, SRA, "alevin/quants_mat.gz", sep = "/")

Run

.log("Starting improved overdispersion estimation for %d samples", length(SRA))
all_sample_data <- lapply(
  SRA,
  function(s) {
    .log("Processing sample: %s", s)
    read_sample_data_improved(
      sample_id = s,
      base_dir  = dir_data,
      n_boot    = 20,
      gtf_file  = gtf_file,
      transcript_info = transcript_info,
      block_cells = 32,
      n_cores = 8,
      omp_threads = 1,
      debug_first_block = TRUE  # set TRUE once to sanity-check bounds
    )
  }
)
gc()

Step 2: create Seurat objects using the improved overdispersion estimates

objs <- lapply(
  1:length(SRA),
  function(i) {
    s_id <- SRA[i]
    s_data <- all_sample_data[[i]]
    process_and_create_seurat_corrected_improved(
      sample_id = s_id,
      counts = s_data$counts,
      od = s_data$od,
      feats = s_data$feats,
      cells = s_data$cells
    )
  }
)
.log("Created %d Seurat objects with improved overdispersion correction", length(objs))
gc()

Step 3 (optional): add feature IDs from tx2g

objs <- set_feature_metadata(objs,transcript_info)
gc()

Step 4: print summary statistics

for (i in 1:length(objs)) {
  seu <- objs[[i]]
  fm <- seu@assays[["RNA_corr"]]@misc[["feature_meta"]]
  
  cat("\n", strrep("=", 50), "\n")
  cat("Sample:", SRA[i], "\n")
  cat("Cells:", ncol(seu), "\n")
  cat("Transcripts:", nrow(seu), "\n")
  cat("Overdispersion stats:\n")
  cat("  Median:", median(fm$OverDisp_integrated), "\n")
  cat("  % > 1.5:", mean(fm$OverDisp_integrated > 1.5) * 100, "%\n")
  cat("  % > 2.0:", mean(fm$OverDisp_integrated > 2.0) * 100, "%\n")
  cat(strrep("=", 50), "\n")
}

Step 5: compare raw to OverDisp_boostrap-corrected counts

results_comprehensive <- analyze_bcv_comprehensive(
  objs = objs,
  output_dir = paste0(dir_out, "/bcv_analysis_comprehensive"),
  run_diagnostics = TRUE,
  correction_method = "direct"
)

============================================================ 
BCV ANALYSIS SUMMARY (between-sample)
============================================================ 

>>> RNA:
    samples: 4
    Transcripts analyzed: 14998
    Common BCV: 0.3653
    Median tagwise BCV: 0.2585
    BCV range: [0.0099, 1.4729]

>>> RNA_corr:
    samples: 4
    Transcripts analyzed: 14568
    Common BCV: 0.3573
    Median tagwise BCV: 0.2455
    BCV range: [0.0099, 1.4491]

------------------------------------------------------------ 
COMPARISON:
  Median BCV reduction: 5.01%
  Common BCV reduction: 2.19%
  → RNA_corr shows REDUCED biological variation ✓
============================================================ 


============================================================
BCV ANALYSIS SUMMARY (within-sample)
============================================================ 

>>> RNA:
    Number of samples analyzed: 4
    Median BCV across samples: 0.0099
    BCV range across samples: [0.0099, 0.0099]

>>> RNA_corr:
    Number of samples analyzed: 4
    Median BCV across samples: 0.0099
    BCV range across samples: [0.0099, 0.0099]

------------------------------------------------------------ 
COMPARISON (within-sample):
  Overall median BCV reduction: 0.00%
  samples showing reduction: 0/4
  → RNA_corr shows UNCHANGED biological variation
============================================================

Figure 3

$Figure 4$

Figure 4

Figure 5

scAmbi: Correcting mapping-ambiguity overdispersion in scRNA-seq

Sean T. Bresnahan

2025-08-14