Coralysis
is an R package for sensitive integration, reference-mapping and cell-state identification of single-cell data (Sousa et al., 2025). It relies on an adapted version of our previously introduced Iterative Clustering Projection (ICP) algorithm (Smolander et al., 2021) to identify shared cell clusters across heterogeneous datasets by leveraging multiple rounds of divisive clustering.
Inspired by the process of assembling a puzzle - where one begins by grouping pieces based on low-to high-level features, such as color and shading, before looking into shape and patterns - this multi-level integration algorithm progressively blends the batch effects while separating cell types across multiple runs of divisive clustering. The trained ICP models can then be used for various purposes, including prediction of cluster identities of related, unannotated single-cell datasets through reference-mapping, and inference of cell states and their differential expression programs using the cell cluster probabilities that represent the likelihood of each cell belonging to each cluster.
While state-of-the-art single-cell integration methods often struggle with imbalanced cell types across heterogeneous datasets, Coralysis
effectively differentiates similar yet unshared cell types across batches (Sousa et al., 2025).
Coralysis
can be installed from Bioconductor
by running the following commands:
# Installation
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Coralysis")
To quickly demonstrate the ability of Coralysis
’s multi-level integration algorithm to integrate data that is unevenly distributed across batches or completely absent from some datasets, we selected two brain datasets — zeisel (Zeisel et al., 2018) and romanov (Romanov et al., 2017). These datasets share only a few cell types, with some similar yet distinct cell types across datasets, such as interneurons in zeisel and neurons in romanov. This makes them an ideal example to highlight the strengths of Coralysis
.
The zeisel (Zeisel et al., 2018) and romanov datasets can be imported as SingleCellExperiment
objects via the scRNAseq
R
package using the functions ZeiselBrainData()
and RomanovBrainData()
. Run the R code below to load the packages Coralysis
, scRNAseq
and SingleCellExperiment
as well as the datasets zeisel and romanov.
# Import packages
library("scRNAseq")
library("Coralysis")
library("SingleCellExperiment")
# Import data
zeisel <- ZeiselBrainData()
romanov <- RomanovBrainData()
The datasets were normalized with the scater
function logNormCounts()
below. Additionaly they were merged based on shared genes and later subset to 2,000 highly variable genes with the scran
function getTopHVGs()
. Cell type labels correspond to level 1 class annotations provided in both SingleCellExperiment
objects. They were stored under cell_type
in colData(sce)
, while batch labels (i.e., Zeisel et al.
or Romanov et al.
) were saved under batch
in colData(sce)
.
# Normalize
zeisel <- scater::logNormCounts(zeisel)
counts(zeisel) <- NULL
altExps(zeisel) <- list()
romanov <- scater::logNormCounts(romanov)
counts(romanov) <- NULL
# Intersected genes
genes <- intersect(row.names(zeisel), row.names(romanov))
zeisel <- zeisel[genes, ]
romanov <- romanov[genes, ]
# Remove colData
colData(zeisel) <- colData(zeisel)[, "level1class", drop = FALSE]
colnames(colData(zeisel)) <- "cell_type"
colData(romanov) <- colData(romanov)[, "level1 class", drop = FALSE]
colnames(colData(romanov)) <- "cell_type"
# Batch
zeisel[["batch"]] <- "Zeisel et al."
romanov[["batch"]] <- "Romanov et al."
# Merge cells
sce <- cbind(zeisel, romanov)
rm(list = c("zeisel", "romanov"))
# HVG
hvg <- scran::getTopHVGs(sce, n = 2000)
sce <- sce[hvg, ]
The PrepareData()
function checks whether a sparse matrix is available in the logcounts
assay (which corresponds to log-normalized data) and removes non-expressed features preparing the object to run RunParallelDivisiveICP()
next.
# Prepare data:
# checks 'logcounts' format & removes non-expressed genes
sce <- PrepareData(object = sce)
## Data in `logcounts` slot already of `dgCMatrix` class...
## 2000/2000 features remain after filtering features with only zero values.
The multi-level integration algorithm is implemented in the RunParallelDivisiveICP()
function, the main function in Coralysis
. It only requires a SingleCellExperiment
object, which in this case is sce
.
To perform integration across a batch, a batch.label
available in colData(sce)
must be provided. In this case, it is "batch"
, created above to hold the information about the brain dataset origin, i.e., Zeisel et al. or Romanov et al.. The ensemble algorithm runs 50 times by default, but for illustrative purposes, this has been reduced to 10 (L=10
).
Two threads are allocated to speed up the process (threads=2
), though by default, the function uses all available system threads. Specify one thread if you prefer not to use any additional threads. In addition, the cells are downsampled to 1,000 cells per batch to speed up the process (icp.batch.size = 1000
). The seed provided in set.seed
is required to ensure the reproducibility of the sampling taken before parallelization. To ensure reproducibility during parallelization, the RNGseed
parameter is set to 123
by default.
The result consists of a set of models and their respective probability matrices (n = 40; log2(k
) * L
), stored in metadata(sce)$coralysis$models
and metadata(sce)$coralysis$joint.probability
, respectively.
# Multi-level integration
set.seed(4591) # reproducibility of downsampling - 'icp.batch.size'
sce <- RunParallelDivisiveICP(
object = sce,
batch.label = "batch",
icp.batch.size = 1000,
L = 10, threads = 2,
RNGseed = 123
)
##
## Downsampling dataset by batch using random sampling.
##
## Building training set...
## Training set successfully built.
##
## Computing cluster seed.
##
## Initializing divisive ICP clustering...
##
|
| | 0%
|
|======= | 10%
|
|============== | 20%
|
|===================== | 30%
|
|============================ | 40%
|
|=================================== | 50%
|
|========================================== | 60%
|
|================================================= | 70%
|
|======================================================== | 80%
|
|=============================================================== | 90%
|
|======================================================================| 100%
##
## Divisive ICP clustering completed successfully.
##
## Predicting cell cluster probabilities using ICP models...
## Prediction of cell cluster probabilities completed successfully.
##
## Multi-level integration completed successfully.
The integrated result of Coralysis
consist of an integrated embedding which can be obtained by running the function RunPCA
. This integrated PCA can, in turn, be used downstream for clustering or non-linear dimensional reduction techniques, such as RunTSNE
or RunUMAP
. The function RunPCA
runs by default the PCA method implemented the R
package irlba
(pca.method="irlba"
), which requires a seed to ensure the same PCA result.
# Integrated embedding
set.seed(125)
sce <- RunPCA(object = sce)
## Divisive ICP: selecting ICP tables multiple of 4
Compute UMAP by running the function RunUMAP()
.
# UMAP
set.seed(1204)
sce <- RunUMAP(object = sce, min_dist = 0.5, n_neighbors = 15)
Finally, the integration can be visually inspected by highlighting the batch and cell type labels into the UMAP projection. Coralysis
is sensitive enough to integrate cell types shared across datasets such as oligodendrocytes, microglia and endothelial cells, but not ependymal or pyramidal cells. This difference arises because Romanov et al., 2017 focused specifically on the hypothalamus, whereas Zeisel et al., 2018 provided a broader characterization of the entire mouse nervous system, including multiple brain regions.
# Visualize categorical variables integrated emb.
vars <- c("batch", "cell_type")
plots <- lapply(X = vars, FUN = function(x) {
PlotDimRed(
object = sce, color.by = x,
point.size = 0.25, point.stroke = 0.5,
legend.nrow = 4
)
})
cowplot::plot_grid(plotlist = plots, ncol = 2, align = "vh") # join plots together
To allow a better distinction between batches, plot cell types by batch below.
plots[[2]] + ggplot2::facet_grid(~batch)
Interestingly, Coralysis
integrated VSM (Vascular Smooth Muscle) cells from Romanov et al., 2017 with a subset of endothelial mural cells from Zeisel et al., 2018. Gene expression visualization of cell type markers for VSM — Acta2, Myh11, Tagln — and endothelial — Cdh5, Pecam1 — cells support this integration.
# Visualization of gene expression markers
markers <- c("Acta2", "Myh11", "Tagln", "Cdh5", "Pecam1")
plot.markers <- lapply(markers, function(x) {
PlotExpression(sce, color.by = x, point.size = 0.5, point.stroke = 0.5) +
ggplot2::facet_grid(~batch)
})
cowplot::plot_grid(plotlist = plot.markers, ncol = 2)
Graph-based clustering can be obtained by running the scran
function clusterCells()
using the Coralysis
integrated embedding. The clustering result can be saved directly into the SingleCellExperiment
object (sce$cluster
).
# Graph-based clustering
set.seed(123)
sce$cluster <- scran::clusterCells(sce,
use.dimred = "PCA",
BLUSPARAM = bluster::SNNGraphParam(k = 30, cluster.fun = "louvain")
)
# Plot the clustering result
plots[[3]] <- PlotDimRed(
object = sce, color.by = "cluster",
point.size = 0.25, point.stroke = 0.5,
legend.nrow = 4
)
plots[[3]]
The comparison below highlights the original cell type annotations and Louvain clustering obtained with the Coralysis
integrated embedding, allowing us to assess the extent of cell population overlap between datasets. For instance, Coralysis
integrated shared populations (e.g., clusters 11, 12 and 14), identified similar but unshared cell populations (e.g., clusters 10 versus 12), and preserved unique populations specific to each dataset (e.g., clusters 7 and 13).
cowplot::plot_grid(
plots[[2]] + ggplot2::facet_grid(~batch),
plots[[3]] + ggplot2::facet_grid(~batch),
ncol = 1
)
Next, we can examine the gene coefficients from the ICP models that best explain the clusters identified above, allowing us to explore the expression of genes that most strongly support these clusters. Below we will look into the identified similar but unshared cell clusters 10 and 12.
Gene coefficients can be obtained for the cluster
label through majority voting with the function MajorityVotingFeatures()
. The majority voting is performed by searching for the most representative cluster for a given cell label across all possible clusters (i.e., across all icp runs and rounds). The most representative cluster corresponds to the cluster with the highest majority voting score. This corresponds to the geometric mean between the proportion of cells from the given label intersected with a cluster and the proportion of cells from that same cluster that intersected with the cells from the given label. The higher the score the better. A cluster that scores 1 indicates that all its cells correspond to the assigned label, and vice versa; i.e., all the cells with the assigned label belong to this cluster. For example, MajorityVotingFeatures()
by providing the colData
variable "cluster"
(i.e., label="cluster"
).
# Get label gene coefficients by majority voting
clts.gene.coeff <- MajorityVotingFeatures(object = sce, label = "cluster")
# Print summary
clts.gene.coeff$summary
## label icp_run icp_round cluster score
## 1 1 7 4 15 0.7321750
## 2 2 5 4 11 0.9278696
## 3 3 1 3 6 0.7120689
## 4 4 5 4 16 0.6855573
## 5 5 8 3 4 0.9170213
## 6 6 5 4 4 0.9449112
## 7 7 4 4 14 0.7893239
## 8 8 8 3 8 0.8476514
## 9 9 4 4 13 0.8389856
## 10 10 10 4 10 0.9493106
## 11 11 9 4 4 0.8929196
## 12 12 4 4 11 0.8897075
## 13 13 9 4 1 0.8881332
## 14 14 1 3 1 0.9253346
Visualize below the expression for the top 6 positive coefficients for the identified similar but unshared cell populations: clusters 10 and 12.
## Visualize top coefficients for cluster 10 and 12
clts <- c("10", "12")
plot.coeff.markers <- list()
for (i in clts) {
top.markers <- clts.gene.coeff$feature_coeff[[i]]
top.markers <- top.markers[order(top.markers[,2], decreasing = TRUE), ]
top.markers <- top.markers[1:6, "feature"]
plot.coeff.markers[[i]] <- lapply(top.markers, function(x) {
PlotExpression(sce, color.by = x, point.size = 0.3, point.stroke = 0.5) +
ggplot2::facet_grid(~batch)
})
}
Visualize the expression of the top 6 positive coefficients for cluster 10. The top 6 positive coefficients identified are highly specific to cluster 10.
# Cluster 10
cowplot::plot_grid(plotlist = plot.coeff.markers[[1]], ncol = 3)
Visualize the expression of the top six positive coefficients for cluster 12. Among the highly specific positive coefficients, myelination-related genes Ptgds and Mal are enriched in cluster 12 compared to cluster 10, where they are practically absent. This suggests that cluster 12 represents a mature, myelinating oligodendrocyte population, in contrast to the more immature cluster 10.
# Cluster 12
cowplot::plot_grid(plotlist = plot.coeff.markers[[2]], ncol = 3)
Overall, ICP gene coefficients supported, to some extent, the identification of similar but unshared cell populations (clusters 10 and 12) with Coralysis
.
# R session
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ensembldb_2.33.0 AnnotationFilter_1.33.0
## [3] GenomicFeatures_1.61.0 AnnotationDbi_1.71.0
## [5] ComplexHeatmap_2.25.0 Coralysis_0.99.6
## [7] scRNAseq_2.23.0 scater_1.37.0
## [9] ggplot2_3.5.2 scuttle_1.19.0
## [11] SingleCellExperiment_1.31.0 SummarizedExperiment_1.39.0
## [13] Biobase_2.69.0 GenomicRanges_1.61.0
## [15] GenomeInfoDb_1.45.3 IRanges_2.43.0
## [17] S4Vectors_0.47.0 BiocGenerics_0.55.0
## [19] generics_0.1.3 MatrixGenerics_1.21.0
## [21] matrixStats_1.5.0 dplyr_1.1.4
## [23] BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 BiocIO_1.19.0 bitops_1.0-9
## [4] filelock_1.0.3 tibble_3.2.1 XML_3.99-0.18
## [7] lifecycle_1.0.4 httr2_1.1.2 aricode_1.0.3
## [10] edgeR_4.7.2 doParallel_1.0.17 lattice_0.22-7
## [13] alabaster.base_1.9.0 magrittr_2.0.3 limma_3.65.0
## [16] sass_0.4.10 rmarkdown_2.29 jquerylib_0.1.4
## [19] yaml_2.3.10 metapod_1.17.0 askpass_1.2.1
## [22] reticulate_1.42.0 cowplot_1.1.3 LiblineaR_2.10-24
## [25] DBI_1.2.3 RColorBrewer_1.1-3 abind_1.4-8
## [28] purrr_1.0.4 RCurl_1.98-1.17 rappdirs_0.3.3
## [31] circlize_0.4.16 ggrepel_0.9.6 irlba_2.3.5.1
## [34] alabaster.sce_1.9.0 pheatmap_1.0.12 umap_0.2.10.0
## [37] RSpectra_0.16-2 dqrng_0.4.1 codetools_0.2-20
## [40] DelayedArray_0.35.1 shape_1.4.6.1 tidyselect_1.2.1
## [43] UCSC.utils_1.5.0 farver_2.1.2 ScaledMatrix_1.17.0
## [46] viridis_0.6.5 BiocFileCache_2.99.0 GenomicAlignments_1.45.0
## [49] jsonlite_2.0.0 GetoptLong_1.0.5 BiocNeighbors_2.3.0
## [52] iterators_1.0.14 foreach_1.5.2 tools_4.5.0
## [55] Rcpp_1.0.14 glue_1.8.0 gridExtra_2.3
## [58] SparseArray_1.9.0 xfun_0.52 flexclust_1.5.0
## [61] HDF5Array_1.37.0 gypsum_1.5.0 withr_3.0.2
## [64] BiocManager_1.30.25 fastmap_1.2.0 rhdf5filters_1.21.0
## [67] bluster_1.19.0 openssl_2.3.2 SparseM_1.84-2
## [70] digest_0.6.37 rsvd_1.0.5 R6_2.6.1
## [73] colorspace_2.1-1 Cairo_1.6-2 dichromat_2.0-0.1
## [76] RSQLite_2.3.11 h5mread_1.1.0 rtracklayer_1.69.0
## [79] class_7.3-23 httr_1.4.7 S4Arrays_1.9.0
## [82] uwot_0.2.3 pkgconfig_2.0.3 gtable_0.3.6
## [85] modeltools_0.2-24 blob_1.2.4 XVector_0.49.0
## [88] htmltools_0.5.8.1 bookdown_0.43 clue_0.3-66
## [91] ProtGenerics_1.41.0 scales_1.4.0 alabaster.matrix_1.9.0
## [94] png_0.1-8 scran_1.37.0 knitr_1.50
## [97] reshape2_1.4.4 rjson_0.2.23 curl_6.2.2
## [100] GlobalOptions_0.1.2 cachem_1.1.0 rhdf5_2.53.0
## [103] stringr_1.5.1 BiocVersion_3.22.0 parallel_4.5.0
## [106] vipor_0.4.7 ggrastr_1.0.2 restfulr_0.0.15
## [109] pillar_1.10.2 alabaster.schemas_1.9.0 vctrs_0.6.5
## [112] RANN_2.6.2 BiocSingular_1.25.0 dbplyr_2.5.0
## [115] beachmat_2.25.0 cluster_2.1.8.1 beeswarm_0.4.0
## [118] evaluate_1.0.3 magick_2.8.6 tinytex_0.57
## [121] cli_3.6.5 locfit_1.5-9.12 compiler_4.5.0
## [124] Rsamtools_2.25.0 rlang_1.1.6 crayon_1.5.3
## [127] labeling_0.4.3 plyr_1.8.9 ggbeeswarm_0.7.2
## [130] stringi_1.8.7 viridisLite_0.4.2 alabaster.se_1.9.0
## [133] BiocParallel_1.43.0 Biostrings_2.77.0 lazyeval_0.2.2
## [136] Matrix_1.7-3 ExperimentHub_2.99.0 sparseMatrixStats_1.21.0
## [139] bit64_4.6.0-1 Rhdf5lib_1.31.0 KEGGREST_1.49.0
## [142] statmod_1.5.0 alabaster.ranges_1.9.0 AnnotationHub_3.99.1
## [145] igraph_2.1.4 memoise_2.0.1 bslib_0.9.0
## [148] bit_4.6.0
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.
Smolander J, Junttila S, Venäläinen MS, Elo LL (2021) “ILoReg: a tool for high-resolution cell population identification from single-cell RNA-seq data.” Bioinformatics, 37(8), 1107–1114. https://doi.org/10.1093/bioinformatics/btaa919
Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi:10.12688/f1000research.9501.2.
McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). “Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.” Bioinformatics, 33, 1179-1186. doi:10.1093/bioinformatics/btw777.
Romanov RA, Zeisel A, Bakker J, Girach F, Hellysaz A, Tomer R, Alpár A, Mulder J, Clotman F, Keimpema E, Hsueh B, Crow AK, Martens H, Schwindling C, Calvigioni D, Bains JS, Máté Z, Szabó G, Yanagawa Y, Zhang M, Rendeiro A, Farlik M, Uhlén M, Wulff P, Bock C, Broberger C, Deisseroth K, Hökfelt T, Linnarsson S, Horvath TL, Harkany T (2017). “A novel organizing principle of the hypothalamus reveals molecularly segregated periventricular dopamine neurons.” Nature neuroscience, 20(2), 176. https://doi.org/10.1038/nn.4462
Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023
Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.
Zeisel A, Hochgerner H, Lönnerberg P, Johnsson A, Memic F, Van Der Zwan J, Häring M, Braun E, Borm LE, La Manno G, Codeluppi S, Furlan A, Lee K, Skene N, Harris KD, Hjerling-Leffler J, Arenas E, Ernfors P, Marklund U, Linnarsson S (2018). “Molecular architecture of the mouse nervous system.” Cell, 174(4), 999-1014. https://doi.org/10.1016/j.cell.2018.06.021