To illustrate the multi-level integration algorithm, we will use in this vignette two samples from the MouseGastrulationData
R
package: 1 and 3. The samples correspond to different pool and embryonic stages:
1: pool index 1 and embryonic stage E6.5 with 360 cells
3: pool index 3 and embryonic stage E7.5 with 458 cells
The samples were imported below with the function EmbryoAtlasData()
from MouseGastrulationData
package as a SingleCellExperiment
object, the object (class) required by most functions in Coralysis
(see Chapter 4 The SingleCellExperiment class - OSCA manual). The SCE
object provided comprises counts
(raw count data), cell colData
(which includes batch and cell labels, designated as pool
and celltype
, respectively), among other data layers.
Run the code below to import the R
packages and data required to reproduce this vignette.
# Packages
library("scran")
library("ggplot2")
library("Coralysis")
library("SingleCellExperiment")
library("MouseGastrulationData")
# Import data
sce <- EmbryoAtlasData(samples = c(1, 3))
Coralysis
requires log-normalised data as input. Run the code below to apply the basic log-normalisation method implemented in Seurat
(see NormalizeData).
## Normalize the data
# log1p normalization
SeuratNormalisation <- function(object) {
# 'SeuratNormalisation()' function applies the basic Seurat normalization to
# a SingleCellExperiment object with a 'counts' assay. Normalized data
# is saved in the 'logcounts' assay.
logcounts(object) <- apply(counts(object), 2, function(x) {
log1p(x / sum(x) * 10000)
}) # log1p normalization w/ 10K scaling factor
logcounts(object) <- as(logcounts(object), "sparseMatrix")
return(object)
}
sce <- SeuratNormalisation(object = sce)
# Remove 'counts' to reduce data size object
counts(sce) <- NULL
In alternative, scran
normalization can be performed, which is particularly beneficial if rare cell types exist (see the following vignette). In addition, have a look into Chapter 7 Normalization - OSCA manual.
Highly variable genes (HVG) can be selected using the R package scran
. The variable pool
from colData
is used as batch label. The SingleCellExperiment
object allows alternative experiments to the main experiment. This is important to keep a backup of all genes in the same SingleCellExperiment
object before selecting HVGs (see SingleCellExperiment vignette).
# Feature selection with 'scran' package
nhvg <- 500
batch.label <- "pool"
sce[[batch.label]] <- factor(sce[[batch.label]])
m.hvg <- scran::modelGeneVar(sce, block = sce[[batch.label]])
top.hvg <- getTopHVGs(m.hvg, n = nhvg)
# Subset object
sce <- sce[top.hvg, ]
# Rename genes: 'Ensembl ID' to SYMBOL
row.names(sce) <- rowData(sce)$SYMBOL
rowData(sce) <- NULL
The batch effect between assays can be inspected below by projecting the data onto t-distributed Stochastic Neighbor Embedding (t-SNE). This can be achieved by running sequentially the Coralysis
functions RunPCA
and RunTSNE
. Provide a seed before running each one of these functions to ensure reproducibility. The function RunPCA
runs by default the PCA method implemented in the R
package irlba
(pca.method="irlba"
), which requires a seed to ensure the same PCA result. In addition, the assay.name
argument needs to be provided, otherwise uses by default the Coralysis probabilities which are obtained only after integration (after running RunParallelDivisiveICP
). The assay logcounts
, corresponding to the log-normalized data, and number of principal components to use p
were provided. Any categorical variable available in colData(sce)
, such as pool
or celltype
, can be visualized in a low dimensional embedding stored in reducedDimNames(sce)
with the Coralysis
function PlotDimRed
.
# Compute PCA & TSNE
set.seed(123)
sce <- RunPCA(
object = sce,
assay.name = "logcounts",
p = 30, dimred.name = "unintPCA"
)
set.seed(123)
sce <- RunTSNE(sce,
dimred.type = "unintPCA",
dimred.name = "unintTSNE"
)
# Plot TSNE highlighting the batch & cell type
unint.batch.plot <- PlotDimRed(
object = sce,
color.by = "pool",
dimred = "unintTSNE",
point.size = 0.01,
legend.nrow = 1,
seed.color = 1024
)
unint.cell.plot <- PlotDimRed(
object = sce,
color.by = "celltype",
dimred = "unintTSNE",
point.size = 0.01,
legend.nrow = 14,
seed.color = 7
)
cowplot::plot_grid(unint.batch.plot, unint.cell.plot, ncol = 2, align = "vh")
Integrate assays with the multi-level integration algorithm implemented in Coralysis
by running the function RunParallelDivisiveICP
. The only arguments required by this function are object
and batch.label
. The object
requires a SingleCellExperiment
object with the assay logcounts
. The matrix in logcounts
should be sparse, i.e., is(logcounts(sce), "dgCMatrix")
is TRUE
, and it should not contain non-expressing genes. This is ensured by running PrepareData
before. The batch.label
argument requires a label column name in colData(sce)
corresponding to the batch label that should be used for integration. In the absence of a batch, the same function, RunParallelDivisiveICP
, can be run without providing batch.label
(i.e., batch.label = NULL
), in which case the data will be modeled through the algorithm to identify fine-grained populations that do not required batch correction. An higher number of threads
can be provided to speed up computing time depending on the number of cores available. For this example, the algorithm was run 10 times (L = 10
), but generally, this number should be higher (with the default being L = 50
). A train set can be built to improve the modeling step in Coralysis
. In this case, the option was set to FALSE
(build.train.set = FALSE
) as it is faster and requires less memory. Otherwise, using this option is recommended as it can make the modeling step faster and consume less memory for larger datasets.
# Prepare data for integration:
# remove non-expressing genes & logcounts is from `dgCMatrix` class
sce <- PrepareData(object = sce)
## Data in `logcounts` slot already of `dgCMatrix` class...
## 500/500 features remain after filtering features with only zero values.
# Perform integration with Coralysis
set.seed(1024)
sce <- RunParallelDivisiveICP(
object = sce,
batch.label = "pool",
build.train.set = FALSE,
L = 10, threads = 2
)
##
## Computing cluster seed.
##
## Initializing divisive ICP clustering...
##
|
| | 0%
|
|======= | 10%
|
|============== | 20%
|
|===================== | 30%
|
|============================ | 40%
|
|=================================== | 50%
|
|========================================== | 60%
|
|================================================= | 70%
|
|======================================================== | 80%
|
|=============================================================== | 90%
|
|======================================================================| 100%
##
## Divisive ICP clustering completed successfully.
##
## Predicting cell cluster probabilities using ICP models...
## Prediction of cell cluster probabilities completed successfully.
##
## Multi-level integration completed successfully.
The integration result can be visually inspected by running sequentially the functions RunPCA
and RunTSNE
. The assay.name
provided to RunPCA
must be joint.probability
(the default), the primary output of integration with Coralysis
. The probability matrices from Coralysis
(i.e., joint.probability
) can be used to obtain an integrated embedding by running RunPCA(..., assay.name = "joint.probability")
. This integrated PCA can, in turn, be used downstream for clustering or non-linear dimensional reduction techniques, such as RunTSNE
. Below, the integrated PCA was named intPCA
.
# Compute PCA with joint cluster probabilities & TSNE
set.seed(123)
sce <- RunPCA(sce,
assay.name = "joint.probability",
dimred.name = "intPCA"
)
## Divisive ICP: selecting ICP tables multiple of 4
set.seed(123)
sce <- RunTSNE(sce,
dimred.type = "intPCA",
dimred.name = "intTSNE"
)
# Plot TSNE highlighting the batch & cell type
int.batch.plot <- PlotDimRed(
object = sce,
color.by = "pool",
dimred = "intTSNE",
point.size = 0.01,
legend.nrow = 1,
seed.color = 1024
)
int.cell.plot <- PlotDimRed(
object = sce,
color.by = "celltype",
dimred = "intTSNE",
point.size = 0.01,
legend.nrow = 14,
seed.color = 7
)
cowplot::plot_grid(int.batch.plot, int.cell.plot,
ncol = 2, align = "vh"
)
Run graph-based clustering with the scran
function clusterCells
(see Chapter 5 Clustering - OSCA manual).
# Graph-based clustering on the integrated PCA w/ 'scran' package
blusparams <- bluster::SNNGraphParam(k = 15, cluster.fun = "louvain")
set.seed(123)
sce$cluster <- scran::clusterCells(sce,
use.dimred = "intPCA",
BLUSPARAM = blusparams
)
# Plot clustering
clt.plot <- PlotDimRed(
object = sce,
color.by = "cluster",
dimred = "intTSNE",
point.size = 0.01,
legend.nrow = 3,
seed.color = 65
)
cowplot::plot_grid(int.batch.plot, int.cell.plot,
clt.plot,
ncol = 3, align = "h"
)
Identify the cluster markers by running the Coralysis
function FindAllClusterMarkers
. Provide the clustering.label
, in this case, the label used above, i.e., cluster
. The top three positive markers per cluster were retrieved and plotted below using the Coralysis
function HeatmapFeatures
.
# Cluster markers
cluster.markers <- FindAllClusterMarkers(object = sce, clustering.label = "cluster")
## -----------------------------------
## testing cluster 1
## 496 features left after min.pct filtering
## 496 features left after min.diff.pct filtering
## 191 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 2
## 497 features left after min.pct filtering
## 497 features left after min.diff.pct filtering
## 183 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 3
## 499 features left after min.pct filtering
## 499 features left after min.diff.pct filtering
## 189 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 4
## 498 features left after min.pct filtering
## 498 features left after min.diff.pct filtering
## 263 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 5
## 498 features left after min.pct filtering
## 498 features left after min.diff.pct filtering
## 277 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 6
## 498 features left after min.pct filtering
## 498 features left after min.diff.pct filtering
## 399 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 7
## 499 features left after min.pct filtering
## 499 features left after min.diff.pct filtering
## 226 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 8
## 497 features left after min.pct filtering
## 497 features left after min.diff.pct filtering
## 210 features left after log2fc.threshold filtering
## -----------------------------------
# Select the top 3 positive markers per cluster
top3.markers <- lapply(X = split(x = cluster.markers, f = cluster.markers$cluster), FUN = function(x) {
head(x[order(x$log2FC, decreasing = TRUE), ], n = 3)
})
top3.markers <- do.call(rbind, top3.markers)
top3.markers <- top3.markers[order(as.numeric(top3.markers$cluster)), ]
# Heatmap of the top 3 positive markers per cluster
HeatmapFeatures(
object = sce,
clustering.label = "cluster",
features = top3.markers$marker,
seed.color = 65
)
Differential gene expression (DGE) between cluster 3 and 6 corresponding roughly to Visceral endoderm
and ExE endoderm
, respectively. The genes Ttr, Rbp4, Apoa1, Apom and Ctsl are upregulated in cluster 6, while Tmsb10 is upregulated in cluster 3.
# DGE analysis: cluster 3 vs 6
dge.clt3vs6 <- FindClusterMarkers(sce,
clustering.label = "cluster",
clusters.1 = "3",
clusters.2 = "6"
)
## testing cluster group.1
## 473 features left after min.pct filtering
## 473 features left after min.diff.pct filtering
## 356 features left after log2fc.threshold filtering
head(dge.clt3vs6[order(abs(dge.clt3vs6$log2FC), decreasing = TRUE), ])
## p.value adj.p.value log2FC pct.1 pct.2 diff.pct marker
## Ttr 1.312605e-26 6.563025e-24 -3.066542 0.5500000 1.0000000 0.4500000 Ttr
## Apoa1 2.699740e-25 1.349870e-22 -2.963640 0.5833333 1.0000000 0.4166667 Apoa1
## Rbp4 7.497694e-27 3.748847e-24 -2.814879 0.5333333 1.0000000 0.4666667 Rbp4
## Apom 8.502979e-27 4.251489e-24 -2.400533 0.5666667 1.0000000 0.4333333 Apom
## Tmsb10 9.995506e-26 4.997753e-23 2.137930 1.0000000 0.6315789 0.3684211 Tmsb10
## Ctsl 4.817850e-23 2.408925e-20 -2.052262 0.8666667 1.0000000 0.1333333 Ctsl
top6.degs <- head(dge.clt3vs6[order(abs(dge.clt3vs6$log2FC),
decreasing = TRUE
), "marker"])
exp.plots <- lapply(X = top6.degs, FUN = function(x) {
PlotExpression(
object = sce, color.by = x,
scale.values = TRUE, point.size = 0.5, point.stroke = 0.5
)
})
cowplot::plot_grid(plotlist = exp.plots, align = "vh", ncol = 3)
# R session
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] MouseGastrulationData_1.23.0 SpatialExperiment_1.19.0
## [3] scran_1.37.0 ensembldb_2.33.0
## [5] AnnotationFilter_1.33.0 GenomicFeatures_1.61.0
## [7] AnnotationDbi_1.71.0 ComplexHeatmap_2.25.0
## [9] Coralysis_0.99.6 scRNAseq_2.23.0
## [11] scater_1.37.0 ggplot2_3.5.2
## [13] scuttle_1.19.0 SingleCellExperiment_1.31.0
## [15] SummarizedExperiment_1.39.0 Biobase_2.69.0
## [17] GenomicRanges_1.61.0 GenomeInfoDb_1.45.3
## [19] IRanges_2.43.0 S4Vectors_0.47.0
## [21] BiocGenerics_0.55.0 generics_0.1.3
## [23] MatrixGenerics_1.21.0 matrixStats_1.5.0
## [25] dplyr_1.1.4 BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 BiocIO_1.19.0 bitops_1.0-9
## [4] filelock_1.0.3 tibble_3.2.1 XML_3.99-0.18
## [7] lifecycle_1.0.4 httr2_1.1.2 aricode_1.0.3
## [10] edgeR_4.7.2 doParallel_1.0.17 lattice_0.22-7
## [13] alabaster.base_1.9.0 magrittr_2.0.3 limma_3.65.0
## [16] sass_0.4.10 rmarkdown_2.29 jquerylib_0.1.4
## [19] yaml_2.3.10 metapod_1.17.0 askpass_1.2.1
## [22] reticulate_1.42.0 cowplot_1.1.3 LiblineaR_2.10-24
## [25] DBI_1.2.3 RColorBrewer_1.1-3 abind_1.4-8
## [28] Rtsne_0.17 purrr_1.0.4 BumpyMatrix_1.17.0
## [31] RCurl_1.98-1.17 rappdirs_0.3.3 circlize_0.4.16
## [34] ggrepel_0.9.6 irlba_2.3.5.1 alabaster.sce_1.9.0
## [37] pheatmap_1.0.12 umap_0.2.10.0 RSpectra_0.16-2
## [40] dqrng_0.4.1 codetools_0.2-20 DelayedArray_0.35.1
## [43] shape_1.4.6.1 tidyselect_1.2.1 UCSC.utils_1.5.0
## [46] farver_2.1.2 ScaledMatrix_1.17.0 viridis_0.6.5
## [49] BiocFileCache_2.99.0 GenomicAlignments_1.45.0 jsonlite_2.0.0
## [52] GetoptLong_1.0.5 BiocNeighbors_2.3.0 iterators_1.0.14
## [55] foreach_1.5.2 tools_4.5.0 Rcpp_1.0.14
## [58] glue_1.8.0 gridExtra_2.3 SparseArray_1.9.0
## [61] xfun_0.52 flexclust_1.5.0 HDF5Array_1.37.0
## [64] gypsum_1.5.0 withr_3.0.2 BiocManager_1.30.25
## [67] fastmap_1.2.0 rhdf5filters_1.21.0 bluster_1.19.0
## [70] openssl_2.3.2 SparseM_1.84-2 digest_0.6.37
## [73] rsvd_1.0.5 R6_2.6.1 colorspace_2.1-1
## [76] Cairo_1.6-2 dichromat_2.0-0.1 RSQLite_2.3.11
## [79] h5mread_1.1.0 rtracklayer_1.69.0 class_7.3-23
## [82] httr_1.4.7 S4Arrays_1.9.0 uwot_0.2.3
## [85] pkgconfig_2.0.3 gtable_0.3.6 modeltools_0.2-24
## [88] blob_1.2.4 XVector_0.49.0 htmltools_0.5.8.1
## [91] bookdown_0.43 clue_0.3-66 ProtGenerics_1.41.0
## [94] scales_1.4.0 alabaster.matrix_1.9.0 png_0.1-8
## [97] knitr_1.50 reshape2_1.4.4 rjson_0.2.23
## [100] curl_6.2.2 GlobalOptions_0.1.2 cachem_1.1.0
## [103] rhdf5_2.53.0 stringr_1.5.1 BiocVersion_3.22.0
## [106] parallel_4.5.0 vipor_0.4.7 ggrastr_1.0.2
## [109] restfulr_0.0.15 pillar_1.10.2 alabaster.schemas_1.9.0
## [112] vctrs_0.6.5 RANN_2.6.2 BiocSingular_1.25.0
## [115] dbplyr_2.5.0 beachmat_2.25.0 cluster_2.1.8.1
## [118] beeswarm_0.4.0 evaluate_1.0.3 magick_2.8.6
## [121] tinytex_0.57 cli_3.6.5 locfit_1.5-9.12
## [124] compiler_4.5.0 Rsamtools_2.25.0 rlang_1.1.6
## [127] crayon_1.5.3 labeling_0.4.3 plyr_1.8.9
## [130] ggbeeswarm_0.7.2 stringi_1.8.7 viridisLite_0.4.2
## [133] alabaster.se_1.9.0 BiocParallel_1.43.0 Biostrings_2.77.0
## [136] lazyeval_0.2.2 Matrix_1.7-3 ExperimentHub_2.99.0
## [139] sparseMatrixStats_1.21.0 bit64_4.6.0-1 Rhdf5lib_1.31.0
## [142] KEGGREST_1.49.0 statmod_1.5.0 alabaster.ranges_1.9.0
## [145] AnnotationHub_3.99.1 igraph_2.1.4 memoise_2.0.1
## [148] bslib_0.9.0 bit_4.6.0
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.
Griffiths J, Lun A (2024). “MouseGastrulationData: Single-Cell-omics Data across Mouse Gastrulation and Early Organogenesis”. 10.18129/B9.bioc.MouseGastrulationData. R package version 1.18.0.
Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi:10.12688/f1000research.9501.2.
Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023
Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.