License: Artistic-2.0
GSVA provides now specific support for single-cell data in the
algorithm that runs through the gsvaParam() parameter
constructor, and originally described in the publication by Hänzelmann, Castelo, and Guinney (2013). At the
moment, this specific support consists of the following features:
dgCMatrix, SVT_SparseArray, and
DelayedMatrix. The currently available container for
single-cell data that allows one to input additional row and column
metadata is a SingleCellExperiment object.matrix or a dense DelayedMatrix object. The
latter will be particularly used when the total number of values exceeds
2^31, which is the largest 32-bit standard integer value in R.sparse=FALSE in the call to
gsvaParam(), the classical GSVA algorithm will be used,
which for a typical single-cell data set will result in longer running
times and larger memory consumption than running it in the default
sparse regime for this type of data.In what follows, we will illustrate the use of GSVA on a publicly available single-cell transcriptomics data set of peripheral blood mononuclear cells (PBMCs) published by Zheng et al. (2017).
We import the PBMC data using the TENxPBMCData
package, as a SingleCellExperiment object, defined in the
SingleCellExperiment
package.
library(SingleCellExperiment)
library(TENxPBMCData)
sce <- TENxPBMCData(dataset="pbmc4k")
sce
class: SingleCellExperiment
dim: 33694 4340
metadata(0):
assays(1): counts
rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
ENSG00000268674
rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
colnames: NULL
colData names(11): Sample Barcode ... Individual Date_published
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):Here, we perform a quality assessment and pre-processing steps using the package scuttle (McCarthy et al. 2017). We start identifying mitochondrial genes.
library(scuttle)
is_mito <- grepl("^MT-", rowData(sce)$Symbol_TENx)
table(is_mito)
is_mito
FALSE TRUE
33681 13 Calculate quality control (QC) metrics and filter out low-quality cells.
sce <- quickPerCellQC(sce, subsets=list(Mito=is_mito),
sub.fields="subsets_Mito_percent")
dim(sce)
[1] 33694 4147Figure @ref(fig:cntxgene) below shows the empirical cumulative distribution of counts per gene in logarithmic scale.
cntxgene <- rowSums(assays(sce)$counts)+1
plot.ecdf(cntxgene, xaxt="n", panel.first=grid(), xlab="UMI counts per gene",
log="x", main="", xlim=c(1, 1e5), las=1)
axis(1, at=10^(0:5), labels=10^(0:5))
abline(v=100, lwd=2, col="red")Empirical cumulative distribution of UMI counts per gene. The red vertical bar indicates a cutoff value of 100 UMI counts per gene across all cells, below which genes will be filtered out.
We filter out lowly-expressed genes, by selecting those with at least 100 UMI counts across all cells for downstream analysis.
Calculate library size factors and normalized units of expression in logarithmic scale.
Here, we illustrate how to annotate cell types in the PBMC data using GSVA.
First, we fetch a collection of 22 leukocyte gene set signatures,
containing a total 547 genes, which should help to distinguish among 22
mature human hematopoietic cell type populations isolated from
peripheral blood or in vitro culture conditions, including
seven T cell types: naïve and memory B cells, plasma cells, NK cell, and
myeloid subsets. These gene sets have been used in the benchmarking
publication by Diaz-Mejia et al. (2019),
and were originally compiled by the CIBERSORT developers, where
they called it the LM22 signature (Newman et al.
2015). The LM22 signature is stored in the GSVAdata
experiment data package as a compressed text file in GMT
format, which can be read into R using the readGMT()
function from the GSVA
package, and will return the gene sets into a
GeneSetCollection object, defined in the GSEABase
package.
library(GSEABase)
library(GSVA)
fname <- file.path(system.file("extdata", package="GSVAdata"),
"pbmc_cell_type_gene_set_signatures.gmt.gz")
gsets <- readGMT(fname)
gsets
GeneSetCollection
names: B_CELLS_MEMORY, B_CELLS_NAIVE, ..., T_CELLS_REGULATORY_TREGS (22 total)
unique identifiers: AIM2, BANK1, ..., SKAP1 (248 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: NullCollection (1 total)Note that while gene identifers in the sce object
correspond to Ensembl
stable identifiers (ENSG...), the gene identifiers in
the gene sets are HGNC gene
symbols. This, in principle, precludes matching directly what gene in
the single-cell data object sce corresponds to what gene
set in the GeneSetCollection object gsets.
However, the GSVA package
can do that matching as long as the appropriate metadata is present in
both objects.
In the case of a GeneSetCollection object, its
geneIdType metadata slot stores the type of gene
identifier. In the case of a SingleCellExperiment object,
such as the previous sce object, such metadata is not
present. However, using the function gsvaAnnotation() from
the GSVA
package, and the helper function ENSEMBLIdentifier() from
the GSEABase
package, we add such metadata to the sce object as
follows.
We first build a parameter object using the function
gsvaParam(). By default, the expression values in the
logocounts assay will be selected for downstream
analysis.
gsvapar <- gsvaParam(sce, gsets)
gsvapar
A GSVA::gsvaParam object
expression data:
class: SingleCellExperiment
dim: 8823 4147
metadata(1): annotation
assays(2): counts logcounts
rownames(8823): ENSG00000279457 ENSG00000228463 ... ENSG00000198727
ENSG00000273748
rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
colnames: NULL
colData names(22): Sample Barcode ... discard sizeFactor
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
using assay: logcounts
using annotation:
geneIdType: ENSEMBL (org.Hs.eg.db)
gene sets:
GeneSetCollection
names: B_CELLS_MEMORY, B_CELLS_NAIVE, ..., T_CELLS_REGULATORY_TREGS (22 total)
unique identifiers: AIM2, BANK1, ..., SKAP1 (248 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: NullCollection (1 total)
gene set size: [1, Inf]
kcdf: auto
kcdfNoneMinSampleSize: 200
tau: 1
maxDiff: TRUE
absRanking: FALSE
sparse: TRUE
checkNA: auto
missing data: didn't check
filterRows: TRUE
ondisk: auto
nonzero values: less than 2^31 (INT_MAX)While at this point, we could already run the entire GSVA algorithm
with a call to the gsva(gsvapar) function. We show here how
to do it in two steps. First we calculate GSVA rank values using the
function gsvaRanks().
gsvaranks <- gsvaRanks(gsvapar)
gsvaranks
A GSVA::gsvaRanksParam object
expression data:
class: SingleCellExperiment
dim: 8823 4147
metadata(1): annotation
assays(3): counts logcounts gsvaranks
rownames(8823): ENSG00000279457 ENSG00000228463 ... ENSG00000198727
ENSG00000273748
rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
colnames: NULL
colData names(22): Sample Barcode ... discard sizeFactor
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
using assay: gsvaranks
using annotation:
geneIdType: ENSEMBL (org.Hs.eg.db)
gene sets:
GeneSetCollection
names: B_CELLS_MEMORY, B_CELLS_NAIVE, ..., T_CELLS_REGULATORY_TREGS (22 total)
unique identifiers: AIM2, BANK1, ..., SKAP1 (248 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: NullCollection (1 total)
gene set size: [1, Inf]
kcdf: auto
kcdfNoneMinSampleSize: 200
tau: 1
maxDiff: TRUE
absRanking: FALSE
sparse: TRUE
checkNA: auto
missing data: didn't check
filterRows: TRUE
ondisk: auto
nonzero values: less than 2^31 (INT_MAX)Second, we calculate the GSVA scores using the output of
gsvaRanks() as input to the function
gsvaScores(). By default, this function will calculate the
scores for all gene sets specified in the input parameter object.
es <- gsvaScores(gsvaranks)
es
class: SingleCellExperiment
dim: 22 4147
metadata(0):
assays(1): es
rownames(22): B_CELLS_MEMORY B_CELLS_NAIVE ... T_CELLS_GAMMA_DELTA
T_CELLS_REGULATORY_TREGS
rowData names(1): gs
colnames: NULL
colData names(22): Sample Barcode ... discard sizeFactor
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):However, we could calculate the scores for another collection of gene
sets by updating them in the gsvaranks object as
follows.
Following Amezquita et al. (2020), and
some of the steps described in “Chapter 5 Clustering” of the first
version of the OSCA
book, we use GSVA scores to build a nearest-neighbor graph of the
cells using the function buildSNNGraph() from the scran
package (Lun, McCarthy, and Marioni 2016).
The parameter k in the call to buildSNNGraph()
specifies the number of nearest neighbors to consider during graph
construction, and here we set k=20 because it leads to a
number of clusters close to the expected number of cell types.
Second, we use the function cluster_walktrap() from the
igraph
package (Csardi and Nepusz 2006), to
cluster cells by finding densely connected subgraphs. We store the
resulting vector of cluster indicator values into the sce
object using the function colLabels().
library(igraph)
colLabels(es) <- factor(cluster_walktrap(g)$membership)
table(colLabels(es))
1 2 3 4 5 6 7 8
495 601 502 525 972 191 345 516 Similarly to Diaz-Mejia et al. (2019),
we apply a simple cell type assignment algorithm, which consists of
selecting at each cell the gene set with highest GSVA score, tallying
the selected gene sets per cluster, and assigning to the cluster the
most frequent gene set, storing that assignment into the
sce object with the function colLabels().
## whmax <- apply(assay(es), 2, which.max)
whmax <- apply(assay(es), 2, function(x) which.max(as.vector(x)))
gsxlab <- split(rownames(es)[whmax], colLabels(es))
gsxlab <- names(sapply(sapply(gsxlab, table), which.max))
colLabels(es) <- factor(gsub("[0-9]\\.", "", gsxlab))[colLabels(es)]
table(colLabels(es))
B_CELLS_NAIVE EOSINOPHILS NK_CELLS_ACTIVATED NK_CELLS_RESTING
601 1027 495 191
T_CELLS_CD4_NAIVE T_CELLS_CD8
1488 345 We can visualize the cell type assignments by projecting cells dissimilarity in two dimensions with a principal components analysis (PCA) on the GSVA scores, and coloring cells using the previously assigned clusters.
library(RColorBrewer)
res <- prcomp(assay(es))
varexp <- res$sdev^2 / sum(res$sdev^2)
nclusters <- nlevels(colLabels(es))
hmcol <- colorRampPalette(brewer.pal(nclusters, "Set1"))(nclusters)
par(mar=c(4, 5, 1, 1))
plot(res$rotation[, 1], res$rotation[, 2], col=hmcol[colLabels(es)], pch=19,
xlab=sprintf("PCA 1 (%.0f%%)", varexp[1]*100),
ylab=sprintf("PCA 2 (%.0f%%)", varexp[2]*100),
las=1, cex.axis=1.2, cex.lab=1.5)
legend("topright", gsub("_", " ", levels(colLabels(es))), fill=hmcol, inset=0.01)Cell type assignments of PBMC scRNA-seq data, based on GSVA scores.
Finally, if we want to better understand why a specific cell type is
annotated to a given cell, we can use the gsvaEnrichment()
function, which will show a GSEA enrichment plot. This function takes as
input the output of gsvaRanks(), a given column (cell) in
the input singl-cell data, and a given gene set. In Figure
@ref(fig:gsvaenrichment) below, we show such a plot for the first cell
annotated to the eosinophil cell type.
firsteosinophilcell <- which(colLabels(es) == "EOSINOPHILS")[1]
par(mar=c(4, 5, 1, 1))
gsvaEnrichment(gsvaranks, column=firsteosinophilcell, geneSet="EOSINOPHILS",
cex.axis=1.2, cex.lab=1.5, plot="ggplot")GSVA enrichment plot of the EOSINOPHILS gene set in the expression profile of the first cell annotated to that cell type.
In the previous call to gsvaEnrichment() we used the
argument plot="ggplot" to produce a plot with the ggplot2 package.
By default, if we call gsvaEnrichment() interactively, it
will produce a plot using “base R”, but either when we do it
non-interactively, or when we set plot="no" it will return
a data.frame object with the enrichment data.
These are features that we are working on and we expect to have them implemented in the near future (e.g., next release):
DelayedArray backend, such as HDF5, is not yet
available.We are still benchmarking and testing this version of GSVA for single-cell data. If you encounter problems or have suggestions, do not hesitate to contact us by opening an issue in the GSVA GitHub repo.
Here is the output of sessionInfo() on the system on
which this document was compiled running pandoc 3.6.3:
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] igraph_2.2.1 scran_1.39.0
[3] scuttle_1.21.0 TENxPBMCData_1.29.0
[5] HDF5Array_1.39.0 h5mread_1.3.1
[7] rhdf5_2.55.12 DelayedArray_0.37.0
[9] SparseArray_1.11.10 S4Arrays_1.11.1
[11] abind_1.4-8 Matrix_1.7-4
[13] SingleCellExperiment_1.33.0 sva_3.59.0
[15] BiocParallel_1.45.0 genefilter_1.93.0
[17] mgcv_1.9-4 nlme_3.1-168
[19] RColorBrewer_1.1-3 edgeR_4.9.2
[21] limma_3.67.0 GSVAdata_1.47.1
[23] SummarizedExperiment_1.41.0 GenomicRanges_1.63.1
[25] Seqinfo_1.1.0 MatrixGenerics_1.23.0
[27] matrixStats_1.5.0 GSEABase_1.73.0
[29] graph_1.89.1 annotate_1.89.0
[31] XML_3.99-0.20 org.Hs.eg.db_3.22.0
[33] AnnotationDbi_1.73.0 IRanges_2.45.0
[35] S4Vectors_0.49.0 Biobase_2.71.0
[37] BiocGenerics_0.57.0 generics_0.1.4
[39] GSVA_2.5.15 BiocStyle_2.39.0
loaded via a namespace (and not attached):
[1] DBI_1.2.3 httr2_1.2.2 rlang_1.1.7
[4] magrittr_2.0.4 otel_0.2.0 compiler_4.5.2
[7] RSQLite_2.4.5 png_0.1-8 vctrs_0.7.1
[10] pkgconfig_2.0.3 SpatialExperiment_1.21.0 crayon_1.5.3
[13] memuse_4.2-3 fastmap_1.2.0 dbplyr_2.5.1
[16] magick_2.9.0 XVector_0.51.0 labeling_0.4.3
[19] rmarkdown_2.30 purrr_1.2.1 bit_4.6.0
[22] bluster_1.21.0 xfun_0.56 cachem_1.1.0
[25] beachmat_2.27.2 jsonlite_2.0.0 blob_1.3.0
[28] rhdf5filters_1.23.3 Rhdf5lib_1.33.0 cluster_2.1.8.1
[31] irlba_2.3.7 parallel_4.5.2 R6_2.6.1
[34] bslib_0.10.0 jquerylib_0.1.4 Rcpp_1.1.1
[37] knitr_1.51 tidyselect_1.2.1 splines_4.5.2
[40] yaml_2.3.12 codetools_0.2-20 curl_7.0.0
[43] tibble_3.3.1 lattice_0.22-7 S7_0.2.1
[46] withr_3.0.2 KEGGREST_1.51.1 evaluate_1.0.5
[49] survival_3.8-6 BiocFileCache_3.1.0 ExperimentHub_3.1.0
[52] Biostrings_2.79.4 filelock_1.0.3 pillar_1.11.1
[55] BiocManager_1.30.27 ggplot2_4.0.1 BiocVersion_3.23.1
[58] scales_1.4.0 sparseMatrixStats_1.23.0 xtable_1.8-4
[61] glue_1.8.0 metapod_1.19.1 maketools_1.3.2
[64] tools_4.5.2 AnnotationHub_4.1.0 BiocNeighbors_2.5.3
[67] sys_3.4.3 ScaledMatrix_1.19.0 locfit_1.5-9.12
[70] buildtools_1.0.0 grid_4.5.2 BiocSingular_1.27.1
[73] cli_3.6.5 rsvd_1.0.5 rappdirs_0.3.4
[76] dplyr_1.1.4 gtable_0.3.6 sass_0.4.10
[79] digest_0.6.39 dqrng_0.4.1 farver_2.1.2
[82] rjson_0.2.23 memoise_2.0.1 htmltools_0.5.9
[85] lifecycle_1.0.5 httr_1.4.7 statmod_1.5.1
[88] bit64_4.6.0-1