Coralysis: sensitive integration of single-cell data

0.1 Introduction
0.2 Installation
0.3 Dataset
0.4 Preprocess
0.5 Multi-level integration
0.6 Integrated embedding
0.7 Nonlinear dimensional reduction
0.8 R session
0.9 References

0.1 Introduction

Coralysis is an R package for sensitive integration, reference-mapping and cell-state identification of single-cell data (Sousa et al., 2025). It relies on an adapted version of our previously introduced Iterative Clustering Projection (ICP) algorithm (Smolander et al., 2021) to identify shared cell clusters across heterogeneous datasets by leveraging multiple rounds of divisive clustering.

Inspired by the process of assembling a puzzle - where one begins by grouping pieces based on low-to high-level features, such as color and shading, before looking into shape and patterns - this multi-level integration algorithm progressively blends the batch effects while separating cell types across multiple runs of divisive clustering. The trained ICP models can then be used for various purposes, including prediction of cluster identities of related, unannotated single-cell datasets through reference-mapping, and inference of cell states and their differential expression programs using the cell cluster probabilities that represent the likelihood of each cell belonging to each cluster.

While state-of-the-art single-cell integration methods often struggle with imbalanced cell types across heterogeneous datasets, Coralysis effectively differentiates similar yet unshared cell types across batches (Sousa et al., 2025).

0.2 Installation

Coralysis can be installed from Bioconductor by running the following commands:

# Installation
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Coralysis")

0.3 Dataset

To quickly demonstrate the ability of Coralysis’s multi-level integration algorithm to integrate data that is unevenly distributed across batches or completely absent from some datasets, we selected two brain datasets — zeisel (Zeisel et al., 2018) and romanov (Romanov et al., 2017). These datasets share only a few cell types, with some similar yet distinct cell types across datasets, such as interneurons in zeisel and neurons in romanov. This makes them an ideal example to highlight the strengths of Coralysis.

The zeisel (Zeisel et al., 2018) and romanov datasets can be imported as SingleCellExperiment objects via the scRNAseq R package using the functions ZeiselBrainData() and RomanovBrainData(). Run the R code below to load the packages Coralysis, scRNAseq and SingleCellExperiment as well as the datasets zeisel and romanov.

# Import packages
library("scRNAseq")
library("Coralysis")
library("SingleCellExperiment")

# Import data
zeisel <- ZeiselBrainData()
romanov <- RomanovBrainData()

0.4 Preprocess

The datasets were normalized with the scater function logNormCounts() below. Additionaly they were merged based on shared genes and later subset to 2,000 highly variable genes with the scran function getTopHVGs(). Cell type labels correspond to level 1 class annotations provided in both SingleCellExperiment objects. They were stored under cell_type in colData(sce), while batch labels (i.e., Zeisel et al. or Romanov et al.) were saved under batch in colData(sce).

# Normalize
zeisel <- scater::logNormCounts(zeisel)
counts(zeisel) <- NULL
altExps(zeisel) <- list()
romanov <- scater::logNormCounts(romanov)
counts(romanov) <- NULL

# Intersected genes
genes <- intersect(row.names(zeisel), row.names(romanov))
zeisel <- zeisel[genes, ]
romanov <- romanov[genes, ]

# Remove colData
colData(zeisel) <- colData(zeisel)[, "level1class", drop = FALSE]
colnames(colData(zeisel)) <- "cell_type"
colData(romanov) <- colData(romanov)[, "level1 class", drop = FALSE]
colnames(colData(romanov)) <- "cell_type"

# Batch
zeisel[["batch"]] <- "Zeisel et al."
romanov[["batch"]] <- "Romanov et al."

# Merge cells
sce <- cbind(zeisel, romanov)
rm(list = c("zeisel", "romanov"))

# HVG
hvg <- scran::getTopHVGs(sce, n = 2000)
sce <- sce[hvg, ]

The PrepareData() function checks whether a sparse matrix is available in the logcounts assay (which corresponds to log-normalized data) and removes non-expressed features preparing the object to run RunParallelDivisiveICP() next.

# Prepare data:
# checks 'logcounts' format & removes non-expressed genes
sce <- PrepareData(object = sce)

## Data in `logcounts` slot already of `dgCMatrix` class...

## 2000/2000 features remain after filtering features with only zero values.

0.5 Multi-level integration

The multi-level integration algorithm is implemented in the RunParallelDivisiveICP() function, the main function in Coralysis. It only requires a SingleCellExperiment object, which in this case is sce.

To perform integration across a batch, a batch.label available in colData(sce) must be provided. In this case, it is "batch", created above to hold the information about the brain dataset origin, i.e., Zeisel et al. or Romanov et al.. The ensemble algorithm runs 50 times by default, but for illustrative purposes, this has been reduced to 10 (L=10).

Two threads are allocated to speed up the process (threads=2), though by default, the function uses all available system threads. Specify one thread if you prefer not to use any additional threads. In addition, the cells are downsampled to 1,000 cells per batch to speed up the process (icp.batch.size = 1000). The seed provided in set.seed is required to ensure the reproducibility of the sampling taken before parallelization. To ensure reproducibility during parallelization, the RNGseed parameter is set to 123 by default.

The result consists of a set of models and their respective probability matrices (n = 40; log2(k) * L), stored in metadata(sce)$coralysis$models and metadata(sce)$coralysis$joint.probability, respectively.

# Multi-level integration
set.seed(4591) # reproducibility of downsampling - 'icp.batch.size'
sce <- RunParallelDivisiveICP(
    object = sce,
    batch.label = "batch",
    icp.batch.size = 1000,
    L = 10, threads = 2,  
    RNGseed = 123
)

## 
## Downsampling dataset by batch using random sampling.

## 
## Building training set...

## Training set successfully built.

## 
## Computing cluster seed.

## 
## Initializing divisive ICP clustering...

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%

## 
## Divisive ICP clustering completed successfully.

## 
## Predicting cell cluster probabilities using ICP models...

## Prediction of cell cluster probabilities completed successfully.

## 
## Multi-level integration completed successfully.

0.6 Integrated embedding

The integrated result of Coralysis consist of an integrated embedding which can be obtained by running the function RunPCA. This integrated PCA can, in turn, be used downstream for clustering or non-linear dimensional reduction techniques, such as RunTSNE or RunUMAP. The function RunPCA runs by default the PCA method implemented the R package irlba (pca.method="irlba"), which requires a seed to ensure the same PCA result.

# Integrated embedding
set.seed(125)
sce <- RunPCA(object = sce)

## Divisive ICP: selecting ICP tables multiple of 4

0.7 Nonlinear dimensional reduction

Compute UMAP by running the function RunUMAP().

# UMAP
set.seed(1204)
sce <- RunUMAP(object = sce, min_dist = 0.5, n_neighbors = 15)

0.7.1 Visualize batch & cell types

Finally, the integration can be visually inspected by highlighting the batch and cell type labels into the UMAP projection. Coralysis is sensitive enough to integrate cell types shared across datasets such as oligodendrocytes, microglia and endothelial cells, but not ependymal or pyramidal cells. This difference arises because Romanov et al., 2017 focused specifically on the hypothalamus, whereas Zeisel et al., 2018 provided a broader characterization of the entire mouse nervous system, including multiple brain regions.

# Visualize categorical variables integrated emb.
vars <- c("batch", "cell_type")
plots <- lapply(X = vars, FUN = function(x) {
    PlotDimRed(
        object = sce, color.by = x,
        point.size = 0.25, point.stroke = 0.5,
        legend.nrow = 4
    )
})

## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
## Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?

cowplot::plot_grid(plotlist = plots, ncol = 2, align = "vh") # join plots together

To allow a better distinction between batches, plot cell types by batch below.

plots[[2]] + ggplot2::facet_grid(~batch)

Interestingly, Coralysis integrated VSM (Vascular Smooth Muscle) cells from Romanov et al., 2017 with a subset of endothelial mural cells from Zeisel et al., 2018. Gene expression visualization of cell type markers for VSM — Acta2, Myh11, Tagln — and endothelial — Cdh5, Pecam1 — cells support this integration.

# Visualization of gene expression markers
markers <- c("Acta2", "Myh11", "Tagln", "Cdh5", "Pecam1")
plot.markers <- lapply(markers, function(x) {
    PlotExpression(sce, color.by = x, point.size = 0.5, point.stroke = 0.5) +
        ggplot2::facet_grid(~batch)
})
cowplot::plot_grid(plotlist = plot.markers, ncol = 2)

0.7.2 Graph-based clustering

Graph-based clustering can be obtained by running the scran function clusterCells() using the Coralysis integrated embedding. The clustering result can be saved directly into the SingleCellExperiment object (sce$cluster).

# Graph-based clustering
set.seed(123)
sce$cluster <- scran::clusterCells(sce,
    use.dimred = "PCA",
    BLUSPARAM = bluster::SNNGraphParam(k = 30, cluster.fun = "louvain")
)

# Plot the clustering result
plots[[3]] <- PlotDimRed(
    object = sce, color.by = "cluster",
    point.size = 0.25, point.stroke = 0.5,
    legend.nrow = 4
)

## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?

plots[[3]]

The comparison below highlights the original cell type annotations and Louvain clustering obtained with the Coralysis integrated embedding, allowing us to assess the extent of cell population overlap between datasets. For instance, Coralysis integrated shared populations (e.g., clusters 11, 12 and 14), identified similar but unshared cell populations (e.g., clusters 10 versus 12), and preserved unique populations specific to each dataset (e.g., clusters 7 and 13).

cowplot::plot_grid(
    plots[[2]] + ggplot2::facet_grid(~batch),
    plots[[3]] + ggplot2::facet_grid(~batch),
    ncol = 1
)

0.7.3 Gene coefficients

Next, we can examine the gene coefficients from the ICP models that best explain the clusters identified above, allowing us to explore the expression of genes that most strongly support these clusters. Below we will look into the identified similar but unshared cell clusters 10 and 12.

Gene coefficients can be obtained for the cluster label through majority voting with the function MajorityVotingFeatures(). The majority voting is performed by searching for the most representative cluster for a given cell label across all possible clusters (i.e., across all icp runs and rounds). The most representative cluster corresponds to the cluster with the highest majority voting score. This corresponds to the geometric mean between the proportion of cells from the given label intersected with a cluster and the proportion of cells from that same cluster that intersected with the cells from the given label. The higher the score the better. A cluster that scores 1 indicates that all its cells correspond to the assigned label, and vice versa; i.e., all the cells with the assigned label belong to this cluster. For example, MajorityVotingFeatures() by providing the colData variable "cluster" (i.e., label="cluster").

# Get label gene coefficients by majority voting
clts.gene.coeff <- MajorityVotingFeatures(object = sce, label = "cluster")

# Print summary
clts.gene.coeff$summary

##    label icp_run icp_round cluster     score
## 1      1       7         4      15 0.7321750
## 2      2       5         4      11 0.9278696
## 3      3       1         3       6 0.7120689
## 4      4       5         4      16 0.6855573
## 5      5       8         3       4 0.9170213
## 6      6       5         4       4 0.9449112
## 7      7       4         4      14 0.7893239
## 8      8       8         3       8 0.8476514
## 9      9       4         4      13 0.8389856
## 10    10      10         4      10 0.9493106
## 11    11       9         4       4 0.8929196
## 12    12       4         4      11 0.8897075
## 13    13       9         4       1 0.8881332
## 14    14       1         3       1 0.9253346

Visualize below the expression for the top 6 positive coefficients for the identified similar but unshared cell populations: clusters 10 and 12.

## Visualize top coefficients for cluster 10 and 12
clts <- c("10", "12")
plot.coeff.markers <- list()
for (i in clts) {
  top.markers <- clts.gene.coeff$feature_coeff[[i]]
  top.markers <- top.markers[order(top.markers[,2], decreasing = TRUE), ]
  top.markers <- top.markers[1:6, "feature"]
  plot.coeff.markers[[i]] <- lapply(top.markers, function(x) {
      PlotExpression(sce, color.by = x, point.size = 0.3, point.stroke = 0.5) +
          ggplot2::facet_grid(~batch)
  })
}

Visualize the expression of the top 6 positive coefficients for cluster 10. The top 6 positive coefficients identified are highly specific to cluster 10.

# Cluster 10
cowplot::plot_grid(plotlist = plot.coeff.markers[[1]], ncol = 3)

Visualize the expression of the top six positive coefficients for cluster 12. Among the highly specific positive coefficients, myelination-related genes Ptgds and Mal are enriched in cluster 12 compared to cluster 10, where they are practically absent. This suggests that cluster 12 represents a mature, myelinating oligodendrocyte population, in contrast to the more immature cluster 10.

# Cluster 12
cowplot::plot_grid(plotlist = plot.coeff.markers[[2]], ncol = 3)

Overall, ICP gene coefficients supported, to some extent, the identification of similar but unshared cell populations (clusters 10 and 12) with Coralysis.

0.8 R session

# R session
sessionInfo()

## R Under development (unstable) (2025-10-20 r88955)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ensembldb_2.35.0            AnnotationFilter_1.35.0    
##  [3] GenomicFeatures_1.63.1      AnnotationDbi_1.73.0       
##  [5] ComplexHeatmap_2.27.0       Coralysis_1.1.0            
##  [7] scRNAseq_2.25.0             scater_1.39.0              
##  [9] ggplot2_4.0.0               scuttle_1.21.0             
## [11] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.0
## [13] Biobase_2.71.0              GenomicRanges_1.63.0       
## [15] Seqinfo_1.1.0               IRanges_2.45.0             
## [17] S4Vectors_0.49.0            BiocGenerics_0.57.0        
## [19] generics_0.1.4              MatrixGenerics_1.23.0      
## [21] matrixStats_1.5.0           dplyr_1.1.4                
## [23] BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22         BiocIO_1.21.0            bitops_1.0-9            
##   [4] filelock_1.0.3           tibble_3.3.0             XML_3.99-0.19           
##   [7] lifecycle_1.0.4          httr2_1.2.1              aricode_1.0.3           
##  [10] edgeR_4.9.0              doParallel_1.0.17        lattice_0.22-7          
##  [13] alabaster.base_1.11.1    magrittr_2.0.4           limma_3.67.0            
##  [16] sass_0.4.10              rmarkdown_2.30           jquerylib_0.1.4         
##  [19] yaml_2.3.10              metapod_1.19.0           askpass_1.2.1           
##  [22] reticulate_1.44.0        cowplot_1.2.0            LiblineaR_2.10-24       
##  [25] DBI_1.2.3                RColorBrewer_1.1-3       abind_1.4-8             
##  [28] purrr_1.1.0              RCurl_1.98-1.17          rappdirs_0.3.3          
##  [31] circlize_0.4.16          ggrepel_0.9.6            irlba_2.3.5.1           
##  [34] alabaster.sce_1.11.0     pheatmap_1.0.13          umap_0.2.10.0           
##  [37] RSpectra_0.16-2          dqrng_0.4.1              codetools_0.2-20        
##  [40] DelayedArray_0.37.0      shape_1.4.6.1            tidyselect_1.2.1        
##  [43] UCSC.utils_1.7.0         farver_2.1.2             ScaledMatrix_1.19.0     
##  [46] viridis_0.6.5            BiocFileCache_3.1.0      GenomicAlignments_1.47.0
##  [49] jsonlite_2.0.0           GetoptLong_1.0.5         BiocNeighbors_2.5.0     
##  [52] iterators_1.0.14         foreach_1.5.2            tools_4.6.0             
##  [55] Rcpp_1.1.0               glue_1.8.0               gridExtra_2.3           
##  [58] SparseArray_1.11.1       xfun_0.54                flexclust_1.5.0         
##  [61] GenomeInfoDb_1.47.0      HDF5Array_1.39.0         gypsum_1.7.0            
##  [64] withr_3.0.2              BiocManager_1.30.26      fastmap_1.2.0           
##  [67] rhdf5filters_1.23.0      bluster_1.21.0           openssl_2.3.4           
##  [70] SparseM_1.84-2           digest_0.6.37            rsvd_1.0.5              
##  [73] R6_2.6.1                 colorspace_2.1-2         Cairo_1.7-0             
##  [76] dichromat_2.0-0.1        RSQLite_2.4.3            cigarillo_1.1.0         
##  [79] h5mread_1.3.0            rtracklayer_1.71.0       class_7.3-23            
##  [82] httr_1.4.7               S4Arrays_1.11.0          uwot_0.2.3              
##  [85] pkgconfig_2.0.3          gtable_0.3.6             modeltools_0.2-24       
##  [88] blob_1.2.4               S7_0.2.0                 XVector_0.51.0          
##  [91] htmltools_0.5.8.1        bookdown_0.45            clue_0.3-66             
##  [94] ProtGenerics_1.43.0      scales_1.4.0             alabaster.matrix_1.11.0 
##  [97] png_0.1-8                scran_1.39.0             knitr_1.50              
## [100] reshape2_1.4.4           rjson_0.2.23             curl_7.0.0              
## [103] GlobalOptions_0.1.2      cachem_1.1.0             rhdf5_2.55.4            
## [106] stringr_1.5.2            BiocVersion_3.23.1       parallel_4.6.0          
## [109] vipor_0.4.7              ggrastr_1.0.2            restfulr_0.0.16         
## [112] pillar_1.11.1            alabaster.schemas_1.11.0 vctrs_0.6.5             
## [115] RANN_2.6.2               BiocSingular_1.27.0      dbplyr_2.5.1            
## [118] beachmat_2.27.0          cluster_2.1.8.1          beeswarm_0.4.0          
## [121] evaluate_1.0.5           magick_2.9.0             tinytex_0.57            
## [124] cli_3.6.5                locfit_1.5-9.12          compiler_4.6.0          
## [127] Rsamtools_2.27.0         rlang_1.1.6              crayon_1.5.3            
## [130] labeling_0.4.3           plyr_1.8.9               ggbeeswarm_0.7.2        
## [133] stringi_1.8.7            viridisLite_0.4.2        alabaster.se_1.11.0     
## [136] BiocParallel_1.45.0      Biostrings_2.79.1        lazyeval_0.2.2          
## [139] Matrix_1.7-4             ExperimentHub_3.1.0      sparseMatrixStats_1.23.0
## [142] bit64_4.6.0-1            Rhdf5lib_1.33.0          KEGGREST_1.51.0         
## [145] statmod_1.5.1            alabaster.ranges_1.11.0  AnnotationHub_4.1.0     
## [148] igraph_2.2.1             memoise_2.0.1            bslib_0.9.0             
## [151] bit_4.6.0

0.9 References

Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.

Smolander J, Junttila S, Venäläinen MS, Elo LL (2021) “ILoReg: a tool for high-resolution cell population identification from single-cell RNA-seq data.” Bioinformatics, 37(8), 1107–1114. https://doi.org/10.1093/bioinformatics/btaa919

Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi:10.12688/f1000research.9501.2.

McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). “Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.” Bioinformatics, 33, 1179-1186. doi:10.1093/bioinformatics/btw777.

Romanov RA, Zeisel A, Bakker J, Girach F, Hellysaz A, Tomer R, Alpár A, Mulder J, Clotman F, Keimpema E, Hsueh B, Crow AK, Martens H, Schwindling C, Calvigioni D, Bains JS, Máté Z, Szabó G, Yanagawa Y, Zhang M, Rendeiro A, Farlik M, Uhlén M, Wulff P, Bock C, Broberger C, Deisseroth K, Hökfelt T, Linnarsson S, Horvath TL, Harkany T (2017). “A novel organizing principle of the hypothalamus reveals molecularly segregated periventricular dopamine neurons.” Nature neuroscience, 20(2), 176. https://doi.org/10.1038/nn.4462

Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023

Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.

Zeisel A, Hochgerner H, Lönnerberg P, Johnsson A, Memic F, Van Der Zwan J, Häring M, Braun E, Borm LE, La Manno G, Codeluppi S, Furlan A, Lee K, Skene N, Harris KD, Hjerling-Leffler J, Arenas E, Ernfors P, Marklund U, Linnarsson S (2018). “Molecular architecture of the mouse nervous system.” Cell, 174(4), 999-1014. https://doi.org/10.1016/j.cell.2018.06.021