1 Introduction

spoon is a method to address the mean-variance relationship in spatially resolved transcriptomics data. Current approaches rank spatially variable genes based on either p-values or some effect size, such as the proportion of spatially variable genes. However, previous work in RNA-sequencing has shown that a technical bias, referred to as the “mean-variance relationship”, exists in these data in that the gene-level variance is correlated with mean RNA expression. We found that there is a “mean-variance relationship” in spatial transcriptomics data, and so we propose spoon, a statistical framework to prioritize spatially variable genes that is not confounded by this relationship. We fit a spline curve to estimate the mean-variance relationship. Then, similar to using weights in a weighted least squares model, we used weights that we plugged into a Gaussian Process Regression model fit with a nearest-neighbor Gaussian process model to the preprocessed expression measurements for each gene, i.e. one model per gene. spoon removes the bias and leads to a more informative set of spatially variable genes.

The generate_weights() function calculates individual observation weights, where an individual observation is a UMI (unique molecular identifier) count value for a specific gene and sample. If the desired SVG detection method accepts weights, then the individual observation weights can be used as inputs. If the desired SVG detection method does not accept weights, then the Delta method is leveraged to rescale the data and covariates by the weights. These scaled data and covariates are used as inputs into the desired SVG detection function.

Bioconductor houses the infrastructure to store and analyze spatially resolved transcriptomics data for R users, including many SVG detection methods. This method addresses the mean-variance relationship confounding SVG detection, which is related to these other Bioconductor packages. Additionally, spoon is inspired by limma::voom() , which is a popular Bioconductor package.

2 Installation

The following code will install the latest release version of the spoon package from Bioconductor. Additional details are shown on the Bioconductor page.

install.packages("BiocManager")
BiocManager::install("spoon")

The latest development version can also be installed from the devel version of Bioconductor or from GitHub.

3 Input data format

We recommend the input data be provided as a SpatialExperiment Bioconductor object. The outputs are stored in the rowData of the SpatialExperiment object. The examples below use this input data format.

The inputs can also be provided as a numeric matrix of raw counts and a numeric matrix of spatial coordinates.

4 Tutorial

Load packages and data

library(nnSVG)
library(STexampleData)
## Loading required package: ExperimentHub
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## Loading required package: SingleCellExperiment
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## The following object is masked from 'package:ExperimentHub':
## 
##     cache
## The following object is masked from 'package:AnnotationHub':
## 
##     cache
## Loading required package: SpatialExperiment
library(SpatialExperiment)
library(BRISC)
## Loading required package: RANN
## Loading required package: parallel
## Loading required package: rdist
## Loading required package: pbapply
## The ordering of inputs x (covariates) and y (response) in BRISC_estimation has been changed BRISC 1.0.0 onwards.
##   Please check the new documentation with ?BRISC_estimation.
library(BiocParallel)
library(scuttle)
library(Matrix)
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
library(spoon)

spe <- Visium_mouseCoronal()
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## loading from cache

Preprocessing

# keep spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]

# filter out low quality genes
spe <- filter_genes(spe)
## Gene filtering: removing mitochondrial genes
## removed 13 mitochondrial genes
## Gene filtering: retaining genes with at least 3 counts in at least 0.5% (n = 14) of spatial locations
## removed 21883 out of 32272 genes due to low expression
# calculate logcounts (log-transformed normalized counts) using scran package
spe <- computeLibraryFactors(spe)
spe <- logNormCounts(spe)

# choose a small number of genes for this example to run quickly
set.seed(3)
ix_random <- sample(seq_len(nrow(spe)), 10)
spe <- spe[ix_random, ]

# remove spots with zero counts
spe <- spe[, colSums(logcounts(spe)) > 0]

Step 1: generate weights

weights <- generate_weights(input = spe,
                            stabilize = TRUE,
                            BPPARAM = MulticoreParam(workers = 1,
                                                     RNGseed = 4))
## 21.8962962962963% of observations had their weight stabilized

Step 2: weighted SVG detection

spe <- weighted_nnSVG(input = spe,
                      w = weights,
                      BPPARAM = MulticoreParam(workers = 1, RNGseed = 5))
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## Warning in nnSVG(spe[i, ], X = matrix(w[, i]), assay_name =
## "weighted_logcounts"): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.

Show results

# display results
rowData(spe)
## DataFrame with 10 rows and 11 columns
##                               gene_id   gene_name    feature_type weighted_mean
##                           <character> <character>     <character>     <numeric>
## ENSMUSG00000030282 ENSMUSG00000030282        Cmas Gene Expression      2.648394
## ENSMUSG00000022601 ENSMUSG00000022601      Zbtb11 Gene Expression      0.950334
## ENSMUSG00000040220 ENSMUSG00000040220        Gas8 Gene Expression      0.403708
## ENSMUSG00000020704 ENSMUSG00000020704       Asic2 Gene Expression      1.320342
## ENSMUSG00000019173 ENSMUSG00000019173       Rab5c Gene Expression      2.830449
## ENSMUSG00000042156 ENSMUSG00000042156       Dzip1 Gene Expression      0.919617
## ENSMUSG00000050856 ENSMUSG00000050856       Atp5k Gene Expression     40.248320
## ENSMUSG00000033594 ENSMUSG00000033594     Spata2l Gene Expression      0.769744
## ENSMUSG00000037003 ENSMUSG00000037003        Tns2 Gene Expression      0.362365
## ENSMUSG00000026817 ENSMUSG00000026817         Ak1 Gene Expression      2.459034
##                    weighted_LR_stat weighted_sigma.sq weighted_tau.sq
##                           <numeric>         <numeric>       <numeric>
## ENSMUSG00000030282        502.84365         1.2603184        1.513976
## ENSMUSG00000022601        179.37314         0.7362638        1.079309
## ENSMUSG00000040220          7.10898         0.0161403        0.763034
## ENSMUSG00000020704        378.39877         1.2209175        1.177762
## ENSMUSG00000019173         38.96783         0.1380797        1.397703
## ENSMUSG00000042156         87.34275         0.1366749        1.315813
## ENSMUSG00000050856        141.25776         3.4727307       11.651505
## ENSMUSG00000033594        308.28753         0.6077145        1.038927
## ENSMUSG00000037003        228.10702         0.8156725        0.582560
## ENSMUSG00000026817        190.23942         0.3057840        1.490160
##                    weighted_prop_sv weighted_phi weighted_padj weighted_rank
##                           <numeric>    <numeric>     <numeric>     <numeric>
## ENSMUSG00000030282        0.4542842     6.153219   0.00000e+00             1
## ENSMUSG00000022601        0.4055269     2.217433   0.00000e+00             6
## ENSMUSG00000040220        0.0207147     2.117323   2.85959e-02            10
## ENSMUSG00000020704        0.5089957     1.872956   0.00000e+00             2
## ENSMUSG00000019173        0.0899084    16.532217   3.45337e-09             9
## ENSMUSG00000042156        0.0940971     4.405542   0.00000e+00             8
## ENSMUSG00000050856        0.2296136    27.166404   0.00000e+00             7
## ENSMUSG00000033594        0.3690630     2.204055   0.00000e+00             3
## ENSMUSG00000037003        0.5833598     0.179875   0.00000e+00             4
## ENSMUSG00000026817        0.1702636    11.461148   0.00000e+00             5

Other SVG detection tools

We provided a function to use the weights with nnSVG for more accurate spatially variable gene detection. The weights can also be used with other spatially variable gene detection tools using the following procedure:

assay_name <- "logcounts"
weighted_logcounts <- t(weights)*assays(spe)[[assay_name]]
assay(spe, "weighted_logcounts") <- weighted_logcounts

weighted_logcounts can be accessed from assay(spe, "weighted_logcounts"). Then, weighted_logcounts should be used as the input counts matrix and weights as the input covariate matrix in a spatially variable detection tool.

5 Adressing the mean-variance relationship

In the Tutorial section, we showed how to use the functions in spoon on a small number of genes for a faster runtime. This section will show how these methods address the mean-variance relationship in spatial transcriptomics data. The code below takes several hours to run, but can be reproduced if desired.

Simulate data

library(SpatialExperiment)
library(STexampleData)
library(MASS)
library(scuttle)

set.seed(1)

#4992 spots and 300 genes

n_genes <- 300
fraction <- 0.5
max_sigma.sq <- 1
low_range_beta <- c(0.5,1)

#check if integer
stopifnot(n_genes*fraction*0.5 == round(n_genes*fraction*0.5))

#all genes have some nonzero sigma.sq
sigma.sq <- runif(n_genes, 0.2, max_sigma.sq)
ground_truth_rank <- rank(-sigma.sq)

#all genes will have nonzero beta values
beta <- runif(n_genes, log(low_range_beta[1]), log(low_range_beta[2]))

#choose fixed length scale parameter (~medium from nnSVG paper)

scale_length <- 200

params <- data.frame(sigma.sq, beta)

plot(beta, sigma.sq)

#sampling from a poisson distribution - mean controls variance, so we don't specify tau.sq:
#step 1: use ST example distance matrix instead of creating a new one (Euclidean distance)

spe_demo <- Visium_humanDLPFC()
points_coord <- spatialCoords(spe_demo)
n_points <- nrow(points_coord)

pair.points <- cbind(
  matrix( rep(points_coord, each = n_points), ncol = 2, byrow = FALSE),
  rep(1, times = n_points) %x% points_coord # Creating the combinations using kronecker product.
) |> data.frame()
colnames(pair.points) <- c("si.x", "si.y", "sj.x", "sj.y")

#step 2: calculate gaussian process/kernel 

kernel.fun <- function(si.x, si.y, sj.x, sj.y,  l = 0.2){
  exp(-1*sqrt(((si.x-sj.x)^2+(si.y-sj.y)^2))/l)
}

C_theta <- with(pair.points, kernel.fun(si.x, si.y, sj.x, sj.y, l = scale_length)) |> 
  matrix(nrow = n_points, ncol = n_points)

counts <- matrix(NA, nrow = n_genes, ncol = n_points)
eta_list <- list()

for (i in c(1:n_genes)) {
  
  sigma.sq_i <- sigma.sq[i]
  beta_i <- beta[i]

  #step 3: simulate gaussian process per gene
  
  gp_dat <- mvrnorm(n = 1, rep(0,n_points), sigma.sq_i* C_theta) 

  #step 4: calculate lambda = exp(beta + gaussian process) per gene
  
  lambda_i <- exp(gp_dat + beta_i)

  #step 5: use rpois() to simulate 4992 values per gene

  counts_i <- rpois(n = n_points, lambda_i)
  
  #put all counts in matrix 
  #orientation: genes x spots
  
  counts[i,] <- counts_i
}

#create spe using counts and distance matrix

spe <- SpatialExperiment(
    assays = list(counts = counts),
    spatialCoords = points_coord)

rowData(spe)$ground_truth <- ground_truth
rowData(spe)$ground_truth_rank <- ground_truth_rank
rowData(spe)$ground_truth_sigma.sq <- sigma.sq
rowData(spe)$ground_truth_beta <- beta

Generate weights and SVG detection

library(SpatialExperiment)
library(nnSVG)
library(BRISC)
library(BiocParallel)
library(scuttle)

spe <- spe[, colSums(counts(spe)) > 0]
spe <- logNormCounts(spe)
weights <- generate_weights(input = spe, stabilize = TRUE)
spe_unweighted <- nnSVG(spe, assay_name = "logcounts")
spe_weighted <- weighted_nnSVG(input = spe, w = weights)

Visualize effect of weighting

library(ggplot2)
library(SpatialExperiment)
library(patchwork)
library(GGally)
library(dplyr)
library(ggridges)

#overlay unweighted and weighted ridge plots
df_unw <- data.frame(
  rank = rowData(spe_unweighted)$rank,
  mean = rowData(spe_unweighted)$mean,
  method = rep("unw", 300) 
) %>% mutate(quantile = findInterval(mean, 
                quantile(mean, probs=0:9/10))) %>%
  tibble::rownames_to_column()

df_w <- data.frame(
  rank = rowData(spe_weighted)$weighted_rank,
  mean = rowData(spe_weighted)$weighted_mean,
  method = rep("w", 300) 
) %>% mutate(quantile = findInterval(mean, 
                quantile(mean, probs=0:9/10))) %>%
  tibble::rownames_to_column()

df <- rbind(df_unw, df_w) %>% 
  mutate(quantile = as.factor(quantile))

ridge_overlay <- ggplot(df, aes(x = rank, y = quantile)) +
  geom_density_ridges2(aes(fill = method), rel_min_height = 0.02, alpha = 0.3, scale = 1.3) +
  theme_ridges(grid = TRUE) +
  labs(
    y = "decile - unw & w mean of logcounts",
    x = "rank",
    title = "Ridge plots: effect of weighting on rank"
  ) +
  scale_fill_manual(labels = c("weighted", "unweighted"), values = c("blue", "red")) +
  coord_cartesian(xlim = c(1, 300)) +
  theme_bw()

#ridge plots separated by noise and signal for unweighted and weighted
frac <- round(dim(spe_unweighted)[1]*0.1)*0.1

df_unw_signal <- df_unw %>%
  mutate(quantile = as.factor(quantile)) %>%
  group_by(quantile) %>%
  slice_min(order_by = rank, n = frac) %>%
  mutate(grp = "signal")

indices <- as.integer(df_unw_signal$rowname)

df_unw_background <- df_unw[-indices,] %>%
  mutate(quantile = as.factor(quantile)) %>%
  mutate(grp = "background")

df <- rbind(df_unw_signal, df_unw_background)

rank_separated_unw <- ggplot(df, aes(x = rank, y = quantile)) +
  geom_density_ridges2(aes(fill = grp), rel_min_height = 0.02, alpha = 0.3) +
  theme_ridges(grid = TRUE) +
  labs(
    y = "decile - unw mean of logcounts",
    x = "rank",
    title = "Signal unweighted"
  ) +
  guides(fill=guide_legend(title="group")) +
  coord_cartesian(xlim = c(1, 300)) +
  theme_bw() 
  
df_w_signal <- df_w %>%
  mutate(quantile = as.factor(quantile)) %>%
  group_by(quantile) %>%
  slice_min(order_by = rank, n = frac) %>%
  mutate(grp = "signal")

indices <- as.integer(df_w_signal$rowname)

df_w_background <- df_w[-indices,] %>%
  mutate(quantile = as.factor(quantile)) %>%
  mutate(grp = "background")

df <- rbind(df_w_signal, df_w_background)

rank_separated_w <- ggplot(df, aes(x = rank, y = quantile)) +
  geom_density_ridges2(aes(fill = grp), rel_min_height = 0.02, alpha = 0.3) +
  theme_ridges(grid = TRUE) +
  labs(
    y = "decile - w mean of logcounts",
    x = "rank",
    title = "Signal weighted"
  )  +
  guides(fill=guide_legend(title="group")) +
  coord_cartesian(xlim = c(1, 300)) +
  theme_bw() 

ridge_signal <- wrap_plots(rank_separated_unw, rank_separated_w, nrow=1, guides = "collect") 

These code chunks have been evaluated and lead to the figure below with both ridge plots. All the simulated genes are spatially varying genes, but to different degrees, and the controlled mean and spatial variance parameters are not correlated. After simulating the data, weights were generated and constrained. To rank SVGs, nnSVG was run on both the unweighted logcounts and on the weighted logcounts matrices. This figure shows the comparison between unweighted nnSVG ranks and weighted nnSVG ranks within deciles based on mean expression. The distribution of ranks moves toward zero for lower mean expression deciles after the weighting is applied, indicating that SVGs are able to be found even when they have low expression (subfigure A). Each decile has spatially variable genes that should be highly ranked, and the weighted method is able to recover these in the lowly expressed deciles. In order to separate background noise from true signal, the densities of the top three ranks from each decile are plotted separately from the remaining genes in each decile (subfigure B). The weighted method is able to find highly ranked SVGs even in lower deciles, showing that spoon addresses the mean-variance relationship.