scAmbi: Correcting mapping-ambiguity overdispersion in scRNA-seq

Sean T. Bresnahan

2025-08-14

Processing Alevin bootstraps for Seurat integration

When quantifying single-cell transcript expression, fragment assignment ambiguity and multi-mapping reads can introduce inferential variance in estimated counts. Salmon/Alevin can output per‑cell bootstrap replicates that capture inferential uncertainty due to fragments mapping to multiple transcripts of the same gene1–3. This uncertainty manifests as overdispersion which, if uncorrected, inflates variability estimates and distorts downstream analyses (differential expression, clustering, trajectory inference)4.

Here we present scAmbi, an integrated approach that (i) estimates per‑transcript overdispersion from Alevin bootstraps in a sparse‑aware way, (ii) combines that estimate with moment‑based and entropy-based complexity priors for further exploration, and (iii) applies the boostrap-based overdispersion result either by building a corrected assay (RNA_corr) or by supplying model offsets. We then evaluate the impact using BCV (biological coefficient of variation) analyses at the between‑sample and within‑sample (cell‑to‑cell) levels.

How this relates to edgeR::catchSalmon()

catchSalmon (bulk) computes per‑transcript technical OD from bootstraps across samples, applies EB moderation with a floor at 1, and analyzes scaled counts. Our approach is analogous but uses cells as replicates. Additionally, scAmbi adds a moment‑based component, an annotation prior, and an integrated overdispersion estimate (boostrap + moment-based + annotation prior) for exploration, and supports offset‑based workflows in addition to a corrected assay.

What’s unique about this package


Dependencies

Required R packages

This package requires R version 4.2 or higher.

Core dependencies

  • Seurat (>= 4.0.0): Single-cell data structures and workflows
  • Matrix (>= 1.3-0): Sparse matrix operations
  • edgeR (>= 3.34.0): Dispersion estimation and BCV calculations
  • Rcpp (>= 1.0.7): C++ integration for performance-critical operations

Data I/O and processing

  • eds (>= 1.0.0): Reading Alevin EDS sparse matrix format
  • tximport (>= 1.20.0): Transcript-level quantification import
  • rtracklayer (>= 1.52.0): GTF file parsing
  • jsonlite (>= 1.7.0): JSON data handling

Visualization

  • ggplot2 (>= 3.3.0): Core plotting framework
  • patchwork (>= 1.1.0): Combining multiple plots
  • dplyr (>= 1.0.0): Data manipulation for summaries
  • tidyr (>= 1.1.0): Data reshaping for visualization

Parallel processing

  • parallel: Built-in R package for multi-core processing

Development

These packages are necessary for building the vignettes and running tests. You can install them by running the following command:

install.packages(c("knitr", "rmarkdown", "testthat"))

System requirements


Installation

From a local source tarball/zip

# If you have scAmbi.zip or a source tar.gz:
install.packages("scAmbi.zip", repos = NULL, type = "source")
# or use devtools:
# devtools::install("path/to/scAmbi/")

Notation and variance model

The overdispersion correction approach implemented in scAmbi is motivated by the observation that transcript-level quantifications exhibit increased technical variance due to read-to-transcript ambiguity (RTA). While reads can typically be assigned unambiguously to genes, they are often compatible with multiple transcripts for the same gene, particularly for genes with many isoforms. This assignment uncertainty disrupts the mean-variance relationship normally observed in RNA-seq data and interferes with standard differential expression methods.

For gene (or transcript) \(g\) in cell \(i\): counts \(y_{gi}\), library size \(N_i\), underlying proportion \(\pi_{gi}\), mean \(\mu_{gi}=N_i\pi_{gi}\). Read-to-transcript ambiguity inflates technical variance beyond Poisson:

\(\mathbb{E}(y_{gi}\mid\pi_{gi})=\mu_{gi}, \qquad \mathrm{Var}(y_{gi}\mid\pi_{gi})=\sigma_g^2\,\mu_{gi}, \; \sigma_g^2>1\)

With biological variation across cells, \(\pi_{gi}\) varies and marginal variance becomes

\[\mathrm{Var}(y_{gi})\;=\;\sigma_g^2\,\mu_{gi}\; +\; \phi_g\,\mu_{gi}^2,\]

with \(\phi_g\) the NB-like biological dispersion4. The key insight is that the technical overdispersion parameter \(\sigma_g^2\) can be estimated from bootstrap samples and divided out of the transcript counts, yielding scaled counts that follow the standard negative binomial mean-variance relationship and can be analyzed with established differential expression methods.


Integrated overdispersion estimation

By default, scAmbi uses a bootstrap-based overdispersion estimate following Baldoni et al.4 for correction. For exploratory purposes, it also computes two additional components per transcript and an integrated estimate by combining the three components on the log scale.

1) Bootstrap-based component (sparse-aware)

Let \(u_{tib}\) be the Alevin bootstrap count for transcript \(t\), cell \(i\), bootstrap \(b = 1..B\).
For each cell we traverse the sparse bootstrap columns once, accumulating for each transcript:

For any transcript marked as seen in a cell, we set \(k = B\) (all replicates, zeros included), compute:

\[ \bar{u}_{ti} = \frac{1}{B} \sum_{b=1}^B u_{tib}, \quad s^2_{ti} = \frac{1}{B - 1} \sum_{b=1}^B \left(u_{tib} - \bar{u}_{ti}\right)^2 \]

and define the per-cell bootstrap OD inflation as:

\[ \mathrm{OD}_{t,i} = 1 + \frac{s^2_{ti}}{\bar{u}_{ti} + \varepsilon}. \]

We then accumulate across cells:
\(\mathrm{OD\_sum}_t = \sum_i \mathrm{OD}_{t,i}\),
\(\mathrm{DF}_t =\) number of contributing cells,
\(e_t =\) number of expressing cells (same as \(\mathrm{DF}_t\) here).

The raw transcript-level bootstrap OD is:

\[ \widehat{\mathrm{OD}}^{\mathrm{boot}}_t = \frac{\mathrm{OD\_sum}_t}{\mathrm{DF}_t}. \]

EB moderation. Let the prior be the median of values \(>1\) with prior df \(d_0=10\):

\[ \mathrm{OD}^{\mathrm{boot}}_t = \frac{d_0\,\mathrm{Prior} + \mathrm{DF}_t\,\widehat{\mathrm{OD}}^{\mathrm{boot}}_t}{d_0 + \mathrm{DF}_t}, \]

clamped to \(\ge 1\). Transcripts with \(e_t\) below the expression threshold default to 1.

Notes:
1. We use \(k = B\) for seen transcripts and compute an inflation factor \(1 + s^2/\mu\), ensuring that any non-zero bootstrap variance produces \(\mathrm{OD} > 1\).
2. If gene-level OD is desired, aggregate after estimating transcript-level OD (e.g., gene-wise median or counts-weighted mean of transcript ODs).
3. \(\mathrm{OD}^{\mathrm{boot}}_t\) is the default OD estimate for correction.

2) Moment-based component

The moment-based overdispersion estimate captures gene-specific variability patterns by comparing the observed variance to what would be expected under a Poisson model. Sequencing and amplification introduce technical noise that scales with expression level, while biological processes create additional variance through cell-to-cell differences in transcriptional state, cell cycle progression, and stochastic gene expression. Furthermore, low-abundance genes may be captured inconsistently across cells due to dropout effects, inflating their apparent variability.

The raw moment-based estimate uses counts across expressing cells only (non-zeros):

\[\widehat{\mathrm{OD}}^{\mathrm{mom}}_g = \frac{\mathrm{Var}(x_g) + \epsilon}{\mathbb{E}(x_g) + \epsilon}\]

where \(\epsilon\) provides numerical stabilization. When rel_eps > 0, both epsilon terms equal rel_eps * mean. When rel_eps = 0 (default), both epsilon terms equal abs_eps (default \(10^{-8}\)). Using only expressing cells avoids bias from technical zeros while capturing the true biological variability among cells that detectably express the gene.

To account for estimation uncertainty with small sample sizes, confidence-weighted shrinkage is applied toward the null value of 1:

\[\mathrm{OD}^{\mathrm{mom}}_g = \max\!\left(1,\;1+\frac{n_g}{n_g+k}\big(\widehat{\mathrm{OD}}^{\mathrm{mom}}_g-1\big)\right)\]

where \(n_g\) is the number of expressing cells and \(k\) is the shrinkage parameter (default 30). This shrinkage is stronger for genes with fewer expressing cells, reflecting lower confidence in variance estimates from small samples.

A key consideration is that this approach may reduce some genuine biological signal along with technical noise, particularly for genes undergoing continuous state transitions or exhibiting meaningful transcriptional heterogeneity. To address this, the complexity prior (Section 3) provides gene-specific adjustments that account for expected biological variability in structurally complex genes.

3) Complexity prior (entropy-based)

The complexity prior is biologically motivated by the observation that genes with multiple actively expressed isoforms exhibit increased expression variability due to cell-to-cell differences in splicing regulation and quantification uncertainty when similar isoforms compete for read assignment. The prior combines isoform count with expression evenness to capture this increased variability. For each gene \(g\), let \(k_g\) be the number of expressed isoforms (transcripts with mean expression above threshold) and \(E_g\) be the normalized Shannon entropy of isoform expression:

\[E_g = \frac{H_g}{\log(k_g)} = \frac{-\sum_{i} p_i \log(p_i)}{\log(k_g)}\]

where \(p_i\) is the proportion of total gene expression from isoform \(i\). The complexity prior for each transcript is then:

\[\mathrm{OD}^{\mathrm{prior}}_t = 1 + \alpha \cdot \log\big(1 + \max(k_g - 1, 0) \times \max(E_g, 0)\big)\]

where \(\alpha\) is a scaling parameter (default 0.6). This gives higher prior values to transcripts from genes with more expressed isoforms that have relatively even expression levels. Transcripts from single-isoform genes or unmapped transcripts receive a prior of 1.

Figure 1

Figure 1

Figure 2

Figure 2

4) Geometric fusion + adaptive shrinkage

With weights \(w_b=0.8\), \(w_m=0.1\), \(w_p=0.1\):

\[ \log \mathrm{OD}^{\mathrm{int}}_t = w_b\log\mathrm{OD}^{\mathrm{boot}}_t + w_m\log\mathrm{OD}^{\mathrm{mom}}_t + w_p\log\mathrm{OD}^{\mathrm{prior}}_t. \]

For additional stability we shrink toward the median of well-observed transcripts (e.g., \(e_t \ge 50\)) with weight \(\alpha_t = \min(e_t/100,1)\):

\[ \mathrm{OD}^{\mathrm{int,final}}_t = \alpha_t\,\mathrm{OD}^{\mathrm{int}}_t + (1-\alpha_t)\,\mathrm{median}\big\{\mathrm{OD}^{\mathrm{int}}_{e \ge 50}\big\}, \]

then clamp to \([1,100]\).


Applying the correction: assays and offsets

We expose multiple, interchangeable ways to use the OD estimates.

Note: \(\mathrm{OD}^{\mathrm{boot}}_t\) is the default OD estimate for correction.

A) Build a corrected assay (RNA_corr)

Form a left-diagonal scaling with entries \(1/\mathrm{OD}^{\mathrm{boot}}_t\):

\[ Z = \mathrm{diag}\left(\frac{1}{\mathrm{OD}^{\mathrm{boot}}}\right)\,Y, \]

and store it as a second assay RNA_corr alongside the raw RNA counts. Feature-level metadata (feature_meta) is stored in both assays with columns:

B) Use model offsets (no second assay required)

When fitting NB-GLMs (e.g., edgeR), you can supply transcript- and sample-specific offsets instead of modifying counts.

All methods are implemented in calculate_bcv() / analyze_bcv_comprehensive() via offset_method = "ratio" | "vector" | "overdispersion".


BCV analyses

We quantify how correction impacts variability using BCV from edgeR in two complementary modes.

1) Between‑sample BCV (pseudobulk across samples)

  1. Build a transcripts × samples pseudobulk by summing counts per sample.
  2. Option A - Direct: run BCV on RNA_corr pseudobulk. Option B - Offsets: run BCV on raw pseudobulk but add offsets:
    • offset_ratio: from pb_corr/pb_raw (transcript × sample).
    • offset_od: from an OD vector (applied to all samples).
    • offset_vector: from any feature_meta column.
  3. Plot tagwise BCV vs AveLogCPM with common and trended curves; compare RNA vs corrected.

Interpretation. Post‑correction, BCV typically decreases for ambiguous features and the abundance‑dependent trend stabilizes.

2) Within‑sample (cell‑to‑cell) BCV

  1. Randomly partition cells within a sample into \(G\) groups.
  2. Build a transcripts × groups pseudobulk by summing counts per group.
  3. Compute BCV per sample. Offsets may be added using any feature vector, mirroring the between‑sample options.
  4. Summarize across samples and visualize (per‑sample plots and an overall summary).

Interpretation. A successful correction reduces within‑sample (cell‑to‑cell) BCV, especially for highly ambiguous transcripts, without artificially flattening true biological variability.


Dataset and Alevin parameters

For the example dataset (four healthy term placenta samples) and Alevin command‑line options, see the original sections below (kept for continuity).

Dataset for example analysis

For this tutorial, we use scRNA‑seq from four healthy term placentas,5 specifically: SRR14134833, SRR14134834, SRR14134836, and SRR15193608. Libraries consisting of approximately 4.8M cells across these four samples were prepared using the 10x Genomics Single Cell 3′ Reagent Kit v3 and sequenced on an Illumina NovaSeq platform. Reads were trimmed with fastp and down-sampled to 200M reads each. Note these are 3’-scRNA-seq reads (Chromium v3) so there should be considerably less overdispersion than non-3’-selected preps.

Quantification in this tutorial uses our long‑read defined, placenta‑specific transcriptome reference6. Villous placenta RNA was used to prepare ONT cDNA libraries (PCS111), with raw signals basecalled by Guppy. Full‑length cDNA reads were oriented and trimmed with Pychopper2, then corrected for microindels using FMLRC2 with matched short‑read placenta RNA‑seq. Corrected reads were aligned to hg38 using minimap2; primary alignments across all samples were concatenated and assembled de novo with ESPRESSO. The assembly was characterized with SQANTI3 against GENCODE v45 to assign structural classes, and high‑confidence isoforms were identified using multiple orthogonal support signals including long‑read coverage, short‑read splice‑junction support, and proximity of TSS/TTS to placenta CAGE‑seq, DNase‑seq, and SAPAS sites.

Alevin parameters

salmon alevin \
  -l ISR \
  -1 example_1.fastq.gz \
  -2 example_2.fastq.gz \
  --chromiumV3 \
  -i index/transcripts \
  -p 10 \
  --whitelist index/3M-february-2018.txt \
  --numCellBootstraps 20 \
  --dumpFeatures \
  -o quants/example \
  --tgMap index/tx2g.tsv  # tx-to-tx identity table

Key options: --numCellBootstraps (bootstrap count), --dumpFeatures (emit quants_boot_mat.gz), and --tgMap (transcript‑to‑self for transcript‑level quantification).

Minimal usage example

Step 1: read data and compute integrated overdispersion for all samples

Set paths

dir_out  <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/analysis"
dir_data <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/salmon/quants"
SRA <- c("SRR14134833_test","SRR14134834_test","SRR14134836_test","SRR15193608_test")

gtf_file <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/salmon/ESPRESSO_corrected_SC_filtered.gtf"
transcript_info <- "/rsrch5/home/epi/bhattacharya_lab/data/Placenta_scRNAseq/salmon/tx2g_header.tsv"

files <- paste(dir_data, SRA, "alevin/quants_mat.gz", sep = "/")

Run

.log("Starting improved overdispersion estimation for %d samples", length(SRA))
all_sample_data <- lapply(
  SRA,
  function(s) {
    .log("Processing sample: %s", s)
    read_sample_data_improved(
      sample_id = s,
      base_dir  = dir_data,
      n_boot    = 20,
      gtf_file  = gtf_file,
      transcript_info = transcript_info,
      block_cells = 32,
      n_cores = 8,
      omp_threads = 1,
      debug_first_block = TRUE  # set TRUE once to sanity-check bounds
    )
  }
)
gc()

Step 2: create Seurat objects using the improved overdispersion estimates

objs <- lapply(
  1:length(SRA),
  function(i) {
    s_id <- SRA[i]
    s_data <- all_sample_data[[i]]
    process_and_create_seurat_corrected_improved(
      sample_id = s_id,
      counts = s_data$counts,
      od = s_data$od,
      feats = s_data$feats,
      cells = s_data$cells
    )
  }
)
.log("Created %d Seurat objects with improved overdispersion correction", length(objs))
gc()

Step 3 (optional): add feature IDs from tx2g

objs <- set_feature_metadata(objs,transcript_info)
gc()

Step 4: print summary statistics

for (i in 1:length(objs)) {
  seu <- objs[[i]]
  fm <- seu@assays[["RNA_corr"]]@misc[["feature_meta"]]
  
  cat("\n", strrep("=", 50), "\n")
  cat("Sample:", SRA[i], "\n")
  cat("Cells:", ncol(seu), "\n")
  cat("Transcripts:", nrow(seu), "\n")
  cat("Overdispersion stats:\n")
  cat("  Median:", median(fm$OverDisp_integrated), "\n")
  cat("  % > 1.5:", mean(fm$OverDisp_integrated > 1.5) * 100, "%\n")
  cat("  % > 2.0:", mean(fm$OverDisp_integrated > 2.0) * 100, "%\n")
  cat(strrep("=", 50), "\n")
}

Step 5: compare raw to OverDisp_boostrap-corrected counts

results_comprehensive <- analyze_bcv_comprehensive(
  objs = objs,
  output_dir = paste0(dir_out, "/bcv_analysis_comprehensive"),
  run_diagnostics = TRUE,
  correction_method = "direct"
)
============================================================ 
BCV ANALYSIS SUMMARY (between-sample)
============================================================ 

>>> RNA:
    samples: 4
    Transcripts analyzed: 14998
    Common BCV: 0.3653
    Median tagwise BCV: 0.2585
    BCV range: [0.0099, 1.4729]

>>> RNA_corr:
    samples: 4
    Transcripts analyzed: 14568
    Common BCV: 0.3573
    Median tagwise BCV: 0.2455
    BCV range: [0.0099, 1.4491]

------------------------------------------------------------ 
COMPARISON:
  Median BCV reduction: 5.01%
  Common BCV reduction: 2.19%
  → RNA_corr shows REDUCED biological variation ✓
============================================================ 


============================================================
BCV ANALYSIS SUMMARY (within-sample)
============================================================ 

>>> RNA:
    Number of samples analyzed: 4
    Median BCV across samples: 0.0099
    BCV range across samples: [0.0099, 0.0099]

>>> RNA_corr:
    Number of samples analyzed: 4
    Median BCV across samples: 0.0099
    BCV range across samples: [0.0099, 0.0099]

------------------------------------------------------------ 
COMPARISON (within-sample):
  Overall median BCV reduction: 0.00%
  samples showing reduction: 0/4
  → RNA_corr shows UNCHANGED biological variation
============================================================ 
Figure 3

Figure 3

Figure 4

Figure 4

Figure 5

Figure 5


References

1.
Srivastava, A., Malik, L., Sarkar, H. & Patro, R. A Bayesian framework for inter-cellular information sharing improves dscRNA-seq quantification. Bioinformatics 36, i292–i299 (2020).
2.
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods 14, 417–419 (2017).
3.
Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biology 20, 65 (2019).
4.
5.
6.
Bresnahan, S. T. et al. Long-read transcriptome assembly reveals vast isoform diversity in the placenta associated with metabolic and endocrine function. bioRxiv (2025) doi:10.1101/2025.06.26.661362.