plaid 0.99.19
PLAID (Pathway Level Average Intensity Detection) is a novel, ultrafast and memory optimized gene set enrichment scoring algorithm. PLAID demonstrates accurate gene set scoring and outperforms all currently available gene set scoring methods in large bulk and single-cell RNA-seq datasets.
In recent years, computational methods have emerged that calculate enrichment of gene signatures within individual samples, rather than across pooled samples. These signatures offer critical insights into the coordinated activity of functionally related genes, proteins or metabolites, enabling identification of unique molecular profiles based on gene set and pathway activity in individual cells and patients. This strategy is pivotal for patient stratification and advancement of personalized medicine. However, the rise of large-scale datasets, including single-cell profiles and population biobanks, has exposed significant computational inefficiencies in existing methods. Current methods often demand excessive runtime and memory resources, becoming impractical for large datasets. Overcoming these limitations is a focus of current efforts by bioinformatics teams in academia and the pharmaceutical industry, as essential to support basic and clinical biomedical research.
For this vignette, our package includes a small subset of the the pmbc3k single-cell dataset of just 50 cells. Please install the Seurat and SeuratData packages if you want to run this vignette against the full dataset.
library("plaid")
load(system.file("extdata", "pbmc3k-50cells.rda", package = "plaid"),verbose=TRUE)
#> Loading objects:
#> X
#> celltype
dim(X)
#> [1] 7728 50
Note that X is the normalized expression matrix from the Seurat object, not the raw counts matrix. We recommend to run plaid on the log transformed expression matrix, not on the counts, as the average in the logarithmic space is more robust and is in concordance to calculating the geometric mean.
It is not necessary to normalize your expression matrix before running plaid because plaid normalizes the enrichment scores afterwards. However, again, log transformation is recommended.
It is recommended to keep the expression matrix sparse as much as possible because plaid extensively take advantage of sparse matrix computations. But even for dense matrices plaid is fast.
For convenience we have included the 50 Hallmark genesets in our package. But we encourage you to download larger geneset collections as plaid’s speed advantage will be more apparent for larger datasets and large geneset collections.
Plaid needs the gene sets as sparse matrix. If you have your collection of gene sets a a list, we need first to convert the gmt list to matrix format.
hallmarks <- system.file("extdata", "hallmarks.gmt", package = "plaid")
gmt <- read.gmt(hallmarks)
matG <- gmt2mat(gmt)
dim(matG)
#> [1] 4386 50
If you have your own gene sets stored as gmt files, you can
conveniently use the included read.gmt() function to read the gmt
file.
The main function to run plaid is plaid(). We run plaid on our
expression matrix X and gene set matrix matG.
gsetX <- plaid(X, matG, normalize=TRUE)
dim(gsetX)
#> [1] 50 50
The resulting matrix gsetX contains the single-sample enrichment
scores for the specified gene sets and samples.
Notice that by default plaid performs median normalization of the final results. That also means that it is not necessary to normalize your expression matrix before running plaid. However, generally, log transformation is recommended.
Plaid can also be run on the ranked matrix, we will see later that this corresponds to the singscore (Fouratan et al., 2018). Or plaid could be run on the (non-logarithmic) counts which can be used to calculate the scSE score (Pont et al., 2019).
Plaid is fast and memory efficient because it uses very efficient
sparse matrix computation in the back. For very large X, plaid uses
chunked computation by splitting the matrix in chunks to avoid index
overflow. Should you encounter errors, please compute your dataset by
subsetting manually the expression matrix and/or gene sets.
Although X and matG are generally very sparse, be aware that the
result matrix gsetX generally is dense and therefore can become very
large. If you would want to compute the score of 10.000 gene sets on a
million of cells this would create a large 10.000 x 1.000.000 dense
matrix which requires about 75GB of memory.
Once we have the gene sets scores we can use these scores for
statistical analysis. We could compute the differential gene set
expression between two groups using a general t-test or limma directly
on the score matrix gsetX.
Another way to test whether a gene set is statistically significant
would be to test whether the fold-change of the genes in the gene sets
are statistically different than zero. That is, we can perform a one
sample z-test on the logFC of the genes of each gene sets and test
whether they are significantly different from zero. The logFC is
computed from the original (log) expression matrix X and group
vector y.
The function dualGSEA() does both tests: the one-sample z-test on
the logFC and the two-group t-test on the gene set matrix gsetX.
Dual testing has been suggested by Bull et al. (Sci Rep., 2024)
y <- 1*(celltype == "B")
res <- dualGSEA(X, y, G=matG)
#> FC testing using rankcor
#> single-sample testing using plaid
The top significant genesets can be shown with
res <- res[order(res[,"p.dual"]),]
head(res)
#> gsetFC size p.fc p.ss
#> HALLMARK_INTERFERON_GAMMA_RESPONSE 0.09923298 155 0.01233840 1.067381e-05
#> HALLMARK_P53_PATHWAY -0.04702449 130 0.01671060 2.900893e-04
#> HALLMARK_ALLOGRAFT_REJECTION 0.08322962 134 0.29318375 1.311508e-05
#> HALLMARK_INTERFERON_ALPHA_RESPONSE 0.09705679 78 0.01858927 9.406415e-03
#> HALLMARK_PEROXISOME 0.04304130 57 0.02253562 3.818109e-02
#> HALLMARK_HYPOXIA -0.03466136 96 0.02099163 5.356212e-02
#> p.dual q.dual
#> HALLMARK_INTERFERON_GAMMA_RESPONSE 2.175353e-06 0.0001087676
#> HALLMARK_P53_PATHWAY 4.126022e-05 0.0010315054
#> HALLMARK_ALLOGRAFT_REJECTION 3.934233e-04 0.0065570555
#> HALLMARK_INTERFERON_ALPHA_RESPONSE 8.603945e-04 0.0107549311
#> HALLMARK_PEROXISOME 3.790696e-03 0.0379069558
#> HALLMARK_HYPOXIA 4.977878e-03 0.0414823173
The column gsetFC corresponds to the difference in gene set score
and also corresponds to the average foldchange of the genes in the
gene set. The column ‘p.fc corresponds to the test on the preranked
logFC, the column ’p.ss’ corresponds to the two-group t-test on the
geneset scores gsetX. The two p-values are then combined using
Stouffer’s method in the column ‘p.dual’ and adjusted for multiple
testing in column q.dual.
The left figure below plots the fold-change enrichment p.fc vs.
single-sample enrichment p.ss. The right figure shows the volcano plot
p.dual vs. gsetFC:
fc <- res[,"gsetFC"]
pv <- res[,"p.dual"]
p1 <- res[,"p.fc"]
p2 <- res[,"p.ss"]
ii <- head(order(pv))
par(mfrow=c(1,2))
plot( -log10(p1), -log10(p2),
xlab="FC enrichment (-log10p)",
ylab="single-sample enrichment (-log10p)", pch=19)
text( -log10(p1[ii]), -log10(p2[ii]), rownames(res)[ii],pos=2)
plot( fc, -log10(pv), xlab="gsetFC", ylab="-log10p", pch=19)
abline(h=0, v=0, lty=2)
text( fc[ii], -log10(pv[ii]), rownames(res)[ii],pos=2)
Plaid can be used to replicate other enrichment score such as ssGSEA, GSVA, AUCell, Singscore, scSE and UCell. But using plaid, the computation is much faster than the original code.
Computing the singscore requires to compute the ranks of the expression matrix. We have wrapped this in a single convenience function:
sing <- replaid.sing(X, matG)
We have extensively compared the results of replaid.sing and from
the original singscore R package and we showed identical result in
the score, logFC and p-values.
Plaid can also be used to compute the ssGSEA score (Barbie et al., 2009). Using plaid, we can calculate the score upto 100x faster. We have wrapped this in a single convenience function:
ssgsea <- replaid.ssgsea(X, matG, alpha=0)
We have extensively compared the results of replaid.ssgsea() and
from the original GSVA R
package. Note the rank weight parameter alpha. For alpha=0 we
obtained identical result for the score, logFC and p-values. For
non-zero values for alpha the results are close but not exactly the
same. The default value in the original publication and in GSVA is
alpha=0.25.
Single-cell Signature Explorer (scSE) is a fast enrichment algorithm implemented in GO. Computing the scSE score requires running plaid on the linear (not logarithmic) score and perform additional normalization by the total UMI per sample. We have wrapped this in a single convenience function:
scse <- replaid.scse(X, matG, removeLog2=TRUE, scoreMean=FALSE)
#> [replaid.scse] Converting data to linear scale (removing log2)...
To replicate the original “sum-of-UMI” scSE score, set
removeLog2=TRUE and scoreMean=FALSE. scSE and plaid scores become
more similar for removeLog2=FALSE and scoreMean=TRUE.
We have extensively compared the results from replaid.scse and from
the original scSE (implemented in GO lang) and we showed almost
identical results in the score, logFC and p-values.
We can compare all scores in a pairs plot:
S <- cbind(plaid=gsetX[,1], sing=sing[,1], ssgsea=ssgsea[,1], scSE=scse[,1])
pairs(S)
sessionInfo()
#> R Under development (unstable) (2025-12-22 r89219)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] plaid_0.99.19 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.51.1 SummarizedExperiment_1.41.0
#> [3] xfun_0.55 bslib_0.9.0
#> [5] sparsesvd_0.2-3 qlcMatrix_0.9.9
#> [7] Biobase_2.71.0 lattice_0.22-7
#> [9] vctrs_0.6.5 tools_4.6.0
#> [11] generics_0.1.4 parallel_4.6.0
#> [13] stats4_4.6.0 tibble_3.3.1
#> [15] AnnotationDbi_1.73.0 RSQLite_2.4.5
#> [17] blob_1.2.4 pkgconfig_2.0.3
#> [19] Matrix_1.7-4 sparseMatrixStats_1.23.0
#> [21] S4Vectors_0.49.0 RcppParallel_5.1.11-1
#> [23] lifecycle_1.0.5 compiler_4.6.0
#> [25] Biostrings_2.79.4 tinytex_0.58
#> [27] Seqinfo_1.1.0 htmltools_0.5.9
#> [29] sass_0.4.10 yaml_2.3.12
#> [31] BiocSet_1.25.0 pillar_1.11.1
#> [33] crayon_1.5.3 jquerylib_0.1.4
#> [35] tidyr_1.3.2 DelayedArray_0.37.0
#> [37] cachem_1.1.0 magick_2.9.0
#> [39] abind_1.4-8 tidyselect_1.2.1
#> [41] digest_0.6.39 slam_0.1-55
#> [43] dplyr_1.1.4 purrr_1.2.1
#> [45] bookdown_0.46 fastmap_1.2.0
#> [47] grid_4.6.0 SparseArray_1.11.10
#> [49] cli_3.6.5 magrittr_2.0.4
#> [51] S4Arrays_1.11.1 Rfast_2.1.5.2
#> [53] bit64_4.6.0-1 rmarkdown_2.30
#> [55] XVector_0.51.0 httr_1.4.7
#> [57] matrixStats_1.5.0 bit_4.6.0
#> [59] otel_0.2.0 png_0.1-8
#> [61] memoise_2.0.1 evaluate_1.0.5
#> [63] knitr_1.51 GenomicRanges_1.63.1
#> [65] IRanges_2.45.0 BiocIO_1.21.0
#> [67] rlang_1.1.7 zigg_0.0.2
#> [69] docopt_0.7.2 Rcpp_1.1.1
#> [71] ontologyIndex_2.12 glue_1.8.0
#> [73] DBI_1.2.3 BiocManager_1.30.27
#> [75] BiocGenerics_0.57.0 jsonlite_2.0.0
#> [77] R6_2.6.1 plyr_1.8.9
#> [79] MatrixGenerics_1.23.0