When quantifying single-cell transcript expression, fragment assignment ambiguity and multi-mapping reads can introduce inferential variance in estimated counts. Salmon/Alevin can output per‑cell bootstrap replicates that capture inferential uncertainty due to fragments mapping to multiple transcripts of the same gene1–3. This uncertainty manifests as overdispersion which, if uncorrected, inflates variability estimates and distorts downstream analyses (differential expression, clustering, trajectory inference)4.
Here we present scAmbi, an integrated
approach that (i) estimates per‑transcript overdispersion from Alevin
bootstraps in a sparse‑aware way, (ii) combines that estimate with
moment‑based and entropy-based complexity priors for further
exploration, and (iii) applies the boostrap-based overdispersion result
either by building a corrected assay
(RNA_corr
) or by supplying model offsets.
We then evaluate the impact using BCV (biological
coefficient of variation) analyses at the between‑sample and
within‑sample (cell‑to‑cell) levels.
edgeR::catchSalmon()
catchSalmon
(bulk) computes per‑transcript technical OD
from bootstraps across samples, applies EB moderation
with a floor at 1, and analyzes scaled counts. Our
approach is analogous but uses cells as replicates.
Additionally, scAmbi adds a moment‑based component, an
annotation prior, and an integrated
overdispersion estimate (boostrap + moment-based + annotation
prior) for exploration, and supports offset‑based
workflows in addition to a corrected assay.
RNA_corr
directly, or supply offsets constructed from (a) corrected/raw ratios,
(b) OD vectors, or (c) any feature vector stored in
feature_meta
.This package requires R version 4.2 or higher.
The overdispersion correction approach implemented in scAmbi is motivated by the observation that transcript-level quantifications exhibit increased technical variance due to read-to-transcript ambiguity (RTA). While reads can typically be assigned unambiguously to genes, they are often compatible with multiple transcripts for the same gene, particularly for genes with many isoforms. This assignment uncertainty disrupts the mean-variance relationship normally observed in RNA-seq data and interferes with standard differential expression methods.
For gene (or transcript) \(g\) in cell \(i\): counts \(y_{gi}\), library size \(N_i\), underlying proportion \(\pi_{gi}\), mean \(\mu_{gi}=N_i\pi_{gi}\). Read-to-transcript ambiguity inflates technical variance beyond Poisson:
\(\mathbb{E}(y_{gi}\mid\pi_{gi})=\mu_{gi}, \qquad \mathrm{Var}(y_{gi}\mid\pi_{gi})=\sigma_g^2\,\mu_{gi}, \; \sigma_g^2>1\)
With biological variation across cells, \(\pi_{gi}\) varies and marginal variance becomes
\[\mathrm{Var}(y_{gi})\;=\;\sigma_g^2\,\mu_{gi}\; +\; \phi_g\,\mu_{gi}^2,\]
with \(\phi_g\) the NB-like biological dispersion4. The key insight is that the technical overdispersion parameter \(\sigma_g^2\) can be estimated from bootstrap samples and divided out of the transcript counts, yielding scaled counts that follow the standard negative binomial mean-variance relationship and can be analyzed with established differential expression methods.
By default, scAmbi uses a bootstrap-based overdispersion estimate following Baldoni et al.4 for correction. For exploratory purposes, it also computes two additional components per transcript and an integrated estimate by combining the three components on the log scale.
Let \(u_{tib}\) be the Alevin
bootstrap count for transcript \(t\), cell \(i\), bootstrap \(b = 1..B\).
For each cell we traverse the sparse bootstrap columns once,
accumulating for each transcript:
For any transcript marked as seen in a cell, we set \(k = B\) (all replicates, zeros included), compute:
\[ \bar{u}_{ti} = \frac{1}{B} \sum_{b=1}^B u_{tib}, \quad s^2_{ti} = \frac{1}{B - 1} \sum_{b=1}^B \left(u_{tib} - \bar{u}_{ti}\right)^2 \]
and define the per-cell bootstrap OD inflation as:
\[ \mathrm{OD}_{t,i} = 1 + \frac{s^2_{ti}}{\bar{u}_{ti} + \varepsilon}. \]
We then accumulate across cells:
\(\mathrm{OD\_sum}_t = \sum_i
\mathrm{OD}_{t,i}\),
\(\mathrm{DF}_t =\) number of
contributing cells,
\(e_t =\) number of expressing cells
(same as \(\mathrm{DF}_t\) here).
The raw transcript-level bootstrap OD is:
\[ \widehat{\mathrm{OD}}^{\mathrm{boot}}_t = \frac{\mathrm{OD\_sum}_t}{\mathrm{DF}_t}. \]
EB moderation. Let the prior be the median of values \(>1\) with prior df \(d_0=10\):
\[ \mathrm{OD}^{\mathrm{boot}}_t = \frac{d_0\,\mathrm{Prior} + \mathrm{DF}_t\,\widehat{\mathrm{OD}}^{\mathrm{boot}}_t}{d_0 + \mathrm{DF}_t}, \]
clamped to \(\ge 1\). Transcripts with \(e_t\) below the expression threshold default to 1.
Notes:
1. We use \(k = B\) for seen
transcripts and compute an inflation factor \(1 + s^2/\mu\), ensuring that any non-zero
bootstrap variance produces \(\mathrm{OD} >
1\).
2. If gene-level OD is desired, aggregate
after estimating transcript-level OD (e.g., gene-wise median or
counts-weighted mean of transcript ODs).
3. \(\mathrm{OD}^{\mathrm{boot}}_t\) is
the default OD estimate for correction.
The moment-based overdispersion estimate captures gene-specific variability patterns by comparing the observed variance to what would be expected under a Poisson model. Sequencing and amplification introduce technical noise that scales with expression level, while biological processes create additional variance through cell-to-cell differences in transcriptional state, cell cycle progression, and stochastic gene expression. Furthermore, low-abundance genes may be captured inconsistently across cells due to dropout effects, inflating their apparent variability.
The raw moment-based estimate uses counts across expressing cells only (non-zeros):
\[\widehat{\mathrm{OD}}^{\mathrm{mom}}_g = \frac{\mathrm{Var}(x_g) + \epsilon}{\mathbb{E}(x_g) + \epsilon}\]
where \(\epsilon\) provides
numerical stabilization. When rel_eps > 0
, both epsilon
terms equal rel_eps * mean
. When rel_eps = 0
(default), both epsilon terms equal abs_eps
(default \(10^{-8}\)). Using only expressing cells
avoids bias from technical zeros while capturing the true biological
variability among cells that detectably express the gene.
To account for estimation uncertainty with small sample sizes, confidence-weighted shrinkage is applied toward the null value of 1:
\[\mathrm{OD}^{\mathrm{mom}}_g = \max\!\left(1,\;1+\frac{n_g}{n_g+k}\big(\widehat{\mathrm{OD}}^{\mathrm{mom}}_g-1\big)\right)\]
where \(n_g\) is the number of expressing cells and \(k\) is the shrinkage parameter (default 30). This shrinkage is stronger for genes with fewer expressing cells, reflecting lower confidence in variance estimates from small samples.
A key consideration is that this approach may reduce some genuine biological signal along with technical noise, particularly for genes undergoing continuous state transitions or exhibiting meaningful transcriptional heterogeneity. To address this, the complexity prior (Section 3) provides gene-specific adjustments that account for expected biological variability in structurally complex genes.
The complexity prior is biologically motivated by the observation that genes with multiple actively expressed isoforms exhibit increased expression variability due to cell-to-cell differences in splicing regulation and quantification uncertainty when similar isoforms compete for read assignment. The prior combines isoform count with expression evenness to capture this increased variability. For each gene \(g\), let \(k_g\) be the number of expressed isoforms (transcripts with mean expression above threshold) and \(E_g\) be the normalized Shannon entropy of isoform expression:
\[E_g = \frac{H_g}{\log(k_g)} = \frac{-\sum_{i} p_i \log(p_i)}{\log(k_g)}\]
where \(p_i\) is the proportion of total gene expression from isoform \(i\). The complexity prior for each transcript is then:
\[\mathrm{OD}^{\mathrm{prior}}_t = 1 + \alpha \cdot \log\big(1 + \max(k_g - 1, 0) \times \max(E_g, 0)\big)\]
where \(\alpha\) is a scaling parameter (default 0.6). This gives higher prior values to transcripts from genes with more expressed isoforms that have relatively even expression levels. Transcripts from single-isoform genes or unmapped transcripts receive a prior of 1.
Figure 1
Figure 2
With weights \(w_b=0.8\), \(w_m=0.1\), \(w_p=0.1\):
\[ \log \mathrm{OD}^{\mathrm{int}}_t = w_b\log\mathrm{OD}^{\mathrm{boot}}_t + w_m\log\mathrm{OD}^{\mathrm{mom}}_t + w_p\log\mathrm{OD}^{\mathrm{prior}}_t. \]
For additional stability we shrink toward the median of well-observed transcripts (e.g., \(e_t \ge 50\)) with weight \(\alpha_t = \min(e_t/100,1)\):
\[ \mathrm{OD}^{\mathrm{int,final}}_t = \alpha_t\,\mathrm{OD}^{\mathrm{int}}_t + (1-\alpha_t)\,\mathrm{median}\big\{\mathrm{OD}^{\mathrm{int}}_{e \ge 50}\big\}, \]
then clamp to \([1,100]\).
We expose multiple, interchangeable ways to use the OD estimates.
Note: \(\mathrm{OD}^{\mathrm{boot}}_t\) is the default OD estimate for correction.
RNA_corr
)Form a left-diagonal scaling with entries \(1/\mathrm{OD}^{\mathrm{boot}}_t\):
\[ Z = \mathrm{diag}\left(\frac{1}{\mathrm{OD}^{\mathrm{boot}}}\right)\,Y, \]
and store it as a second assay RNA_corr
alongside the
raw RNA
counts. Feature-level metadata
(feature_meta
) is stored in both assays with columns:
OverDisp_integrated
, OverDisp_bootstrap
,
OverDisp_moments
, OverDisp_prior
expressing_cells
, scaling_factor
(=
OverDisp_integrated
), inv_scaling
(=
1/scaling_factor
)When fitting NB-GLMs (e.g., edgeR), you can supply transcript- and sample-specific offsets instead of modifying counts.
RNA_corr
exists): Use \(\log\big(\text{corrected}/\text{raw}\big)\)
per transcript × sample, i.e., offsets from the pseudobulk ratio
pb_corr/pb_raw
.feature_meta
can be used. Interpret as:
vector_as = "od"
for OverDisp_*
or
scaling_factor
.vector_as = "ratio"
for inv_scaling
(since
RNA_corr = RNA * inv_scaling
).All methods are implemented in calculate_bcv()
/
analyze_bcv_comprehensive()
via
offset_method = "ratio" | "vector" | "overdispersion"
.
We quantify how correction impacts variability using BCV from edgeR in two complementary modes.
RNA_corr
pseudobulk. Option B - Offsets: run BCV on raw
pseudobulk but add offsets:
offset_ratio
: from pb_corr/pb_raw
(transcript × sample).offset_od
: from an OD vector (applied to all
samples).offset_vector
: from any feature_meta
column.Interpretation. Post‑correction, BCV typically decreases for ambiguous features and the abundance‑dependent trend stabilizes.
Interpretation. A successful correction reduces within‑sample (cell‑to‑cell) BCV, especially for highly ambiguous transcripts, without artificially flattening true biological variability.
For the example dataset (four healthy term placenta samples) and Alevin command‑line options, see the original sections below (kept for continuity).
For this tutorial, we use scRNA‑seq from four healthy term placentas,5 specifically: SRR14134833, SRR14134834, SRR14134836, and SRR15193608. Libraries consisting of approximately 4.8M cells across these four samples were prepared using the 10x Genomics Single Cell 3′ Reagent Kit v3 and sequenced on an Illumina NovaSeq platform. Reads were trimmed with fastp and down-sampled to 200M reads each. Note these are 3’-scRNA-seq reads (Chromium v3) so there should be considerably less overdispersion than non-3’-selected preps.
Quantification in this tutorial uses our long‑read defined, placenta‑specific transcriptome reference6. Villous placenta RNA was used to prepare ONT cDNA libraries (PCS111), with raw signals basecalled by Guppy. Full‑length cDNA reads were oriented and trimmed with Pychopper2, then corrected for microindels using FMLRC2 with matched short‑read placenta RNA‑seq. Corrected reads were aligned to hg38 using minimap2; primary alignments across all samples were concatenated and assembled de novo with ESPRESSO. The assembly was characterized with SQANTI3 against GENCODE v45 to assign structural classes, and high‑confidence isoforms were identified using multiple orthogonal support signals including long‑read coverage, short‑read splice‑junction support, and proximity of TSS/TTS to placenta CAGE‑seq, DNase‑seq, and SAPAS sites.
salmon alevin \
-l ISR \
-1 example_1.fastq.gz \
-2 example_2.fastq.gz \
--chromiumV3 \
-i index/transcripts \
-p 10 \
--whitelist index/3M-february-2018.txt \
--numCellBootstraps 20 \
--dumpFeatures \
-o quants/example \
--tgMap index/tx2g.tsv # tx-to-tx identity table
Key options: --numCellBootstraps
(bootstrap count),
--dumpFeatures
(emit quants_boot_mat.gz
), and
--tgMap
(transcript‑to‑self for transcript‑level
quantification).
dir_out <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/analysis"
dir_data <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/salmon/quants"
SRA <- c("SRR14134833_test","SRR14134834_test","SRR14134836_test","SRR15193608_test")
gtf_file <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/salmon/ESPRESSO_corrected_SC_filtered.gtf"
transcript_info <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/salmon/tx2g_header.tsv"
files <- paste(dir_data, SRA, "alevin/quants_mat.gz", sep = "/")
.log("Starting improved overdispersion estimation for %d samples", length(SRA))
all_sample_data <- lapply(
SRA,
function(s) {
.log("Processing sample: %s", s)
read_sample_data_improved(
sample_id = s,
base_dir = dir_data,
n_boot = 20,
gtf_file = gtf_file,
transcript_info = transcript_info,
block_cells = 32,
n_cores = 8,
omp_threads = 1,
debug_first_block = TRUE # set TRUE once to sanity-check bounds
)
}
)
gc()
objs <- lapply(
1:length(SRA),
function(i) {
s_id <- SRA[i]
s_data <- all_sample_data[[i]]
process_and_create_seurat_corrected_improved(
sample_id = s_id,
counts = s_data$counts,
od = s_data$od,
feats = s_data$feats,
cells = s_data$cells
)
}
)
.log("Created %d Seurat objects with improved overdispersion correction", length(objs))
gc()
for (i in 1:length(objs)) {
seu <- objs[[i]]
fm <- seu@assays[["RNA_corr"]]@misc[["feature_meta"]]
cat("\n", strrep("=", 50), "\n")
cat("Sample:", SRA[i], "\n")
cat("Cells:", ncol(seu), "\n")
cat("Transcripts:", nrow(seu), "\n")
cat("Overdispersion stats:\n")
cat(" Median:", median(fm$OverDisp_integrated), "\n")
cat(" % > 1.5:", mean(fm$OverDisp_integrated > 1.5) * 100, "%\n")
cat(" % > 2.0:", mean(fm$OverDisp_integrated > 2.0) * 100, "%\n")
cat(strrep("=", 50), "\n")
}
results_comprehensive <- analyze_bcv_comprehensive(
objs = objs,
output_dir = paste0(dir_out, "/bcv_analysis_comprehensive"),
run_diagnostics = TRUE,
correction_method = "direct"
)
============================================================
BCV ANALYSIS SUMMARY (between-sample)
============================================================
>>> RNA:
samples: 4
Transcripts analyzed: 14998
Common BCV: 0.3653
Median tagwise BCV: 0.2585
BCV range: [0.0099, 1.4729]
>>> RNA_corr:
samples: 4
Transcripts analyzed: 14568
Common BCV: 0.3573
Median tagwise BCV: 0.2455
BCV range: [0.0099, 1.4491]
------------------------------------------------------------
COMPARISON:
Median BCV reduction: 5.01%
Common BCV reduction: 2.19%
→ RNA_corr shows REDUCED biological variation ✓
============================================================
============================================================
BCV ANALYSIS SUMMARY (within-sample)
============================================================
>>> RNA:
Number of samples analyzed: 4
Median BCV across samples: 0.0099
BCV range across samples: [0.0099, 0.0099]
>>> RNA_corr:
Number of samples analyzed: 4
Median BCV across samples: 0.0099
BCV range across samples: [0.0099, 0.0099]
------------------------------------------------------------
COMPARISON (within-sample):
Overall median BCV reduction: 0.00%
samples showing reduction: 0/4
→ RNA_corr shows UNCHANGED biological variation
============================================================
Figure 3
Figure 4
Figure 5