Cell states

0.1 Dataset
0.2 Normalization
0.3 HVG selection
0.4 Multi-level integration
0.5 Dimensional reduction
0.6 Cell state identification
0.7 ICP clusters
0.8 Graph-based clustering
0.9 Cell cluster probability bins
0.10 Differential expression programs
0.11 R session
0.12 References

0.1 Dataset

The single-cell RNA sequencing dataset of mammary epithelial cells published by Bach et al., (2017) will be used to demonstrated how cell cluster probability can be used to identify cell states and associated differential gene expression programs.

This dataset was imported as a SingleCellExperiment object via the scRNAseq R package using the function BachMammaryData(). Additionally, the dataset was normalized with the scater function logNormCounts(), and then subset to 2,000 highly variable genes with the scran function getTopHVGs().

# Import packages
library("dplyr")
library("scater")
library("ggplot2")
library("scRNAseq")
library("Coralysis")
library("ComplexHeatmap")
library("SingleCellExperiment")

# Import object
sce <- BachMammaryData()
colnames(sce) <- paste0("cell", 1:ncol(sce)) # create cell names

0.2 Normalization

Coralysis requires log-normalised data as input. scran normalization can be performed, which is particularly beneficial if rare cell types exist (see the following vignette).

## Normalize the data
set.seed(123)
sce <- scater::logNormCounts(sce)
counts(sce) <- NULL

0.3 HVG selection

Highly variable genes (HVG) can be selected using the R package scran.

# Feature selection with 'scran' package
nhvg <- 500
hvg <- scran::getTopHVGs(sce, n = nhvg)
hvg.idx <- which(row.names(sce) %in% hvg)
sce <- sce[hvg.idx, ]
row.names(sce) <- rowData(sce)$Symbol

0.4 Multi-level integration

Multi-level integration can be performed with Coralysis by running the function RunParallelDivisiveICP(). Since there is not a batch, the function RunParallelDivisiveICP() is run with batch.label=NULL. The RunParallelDivisiveICP() function can be run in parallel by providing the number of threads to reduce the run time. Consult the full documentation of this function with ?RunParallelDivisiveICP. For this example, the algorithm was run 10 times (L = 25), but generally, this number should be higher (with the default being L = 50). A train set can be built to improve the modeling step in Coralysis (by default, build.train.set=TRUE), with the respective named list of parameters provided to build.train.params. In this case, no training set was built; instead, the entire dataset was downsampled to 1,000 cells before each ICP run (icp.batch.size=1000).

# Perform multi-level integration
set.seed(123)
sce <- RunParallelDivisiveICP(object = sce, L = 25, 
                              icp.batch.size = 1000,
                              build.train.set = FALSE,
                              threads = 2)

## WARNING: Setting 'divisive.method' to 'cluster' as 'batch.label=NULL'. 
## If 'batch.label=NULL', 'divisive.method' can be one of: 'cluster', 'random'.

## 
## Computing cluster seed.

## 
## Initializing divisive ICP clustering...

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%

## 
## Divisive ICP clustering completed successfully.

## 
## Predicting cell cluster probabilities using ICP models...

## Prediction of cell cluster probabilities completed successfully.

## 
## Multi-level integration completed successfully.

0.5 Dimensional reduction

Coralysis returns a list of cell cluster probabilities saved at metadata(sce)$coralysis$joint.probability of length equal to the number of icp runs, i.e., L (by default L=20), times the number of icp rounds, i.e., log2(k) (by default k=16, thus 4 rounds of divisive icp). The cell cluster probability can be concatenated to compute a PCA in order to obtain an integrated embedding. By default only the cell cluster probability corresponding to the last icp round is used. The Coralysis integrated embedding can be used downstream for non-linear dimensional reduction, t-SNE or UMAP, and clustering. Set to TRUE the return.model argument to allow reference mapping later.

# Dimensional reduction - unintegrated
set.seed(123)
sce <- RunPCA(
    object = sce, assay.name = "joint.probability",
    return.model = TRUE
)

## Divisive ICP: selecting ICP tables multiple of 4

## Setting 'pca.method' to 'stats' as 'return.model' is TRUE.
## Set 'return.model' to FALSE to use 'method' 'irlba'.

# UMAP
set.seed(123)
sce <- RunUMAP(
    object = sce, umap.method = "uwot",
    dims = 1:30, n_neighbors = 15,
    min_dist = 0.5, return.model = TRUE
)

Check the difference between basal (expressing Krt5) and luminal (expressing Krt18) epithelial cells.

# Detect basal versus luminal cells
markers <- c("Krt5", "Krt18")
plots <- lapply(markers, function(x) {
    PlotExpression(sce,
        color.by = x, point.size = 0.2,
        point.stroke = 0.2, scale.values = TRUE
    )
})
cowplot::plot_grid(plotlist = plots, ncol = 2)

0.6 Cell state identification

The cell cluster probability aggregated by mean or median across the icp runs can be obtained with the function SummariseCellClusterProbability() and plotted to infer transient and steady cell states.

# Summarise cell cluster probability
sce <- SummariseCellClusterProbability(object = sce, icp.round = 4) # save result in 'colData'
# colData(sce) # check the colData

It is clear that the largest basal cell population has the lowest probability compared with the luminal cell populations.

# Plot cell cluster probabilities - mean
# possible options: "mean_probs", "median_probs", "scaled_median_probs"
PlotExpression(
    object = sce, color.by = "scaled_mean_probs",
    color.scale = "viridis", point.size = 0.2,
    point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)"
)

0.7 ICP clusters

The ICP clusters corresponding to the ICP probability table with the highest standard deviation (i.e., 14) for the last clustering round (i.e., 4) was obtained below ("icp_run_round_14_4_clusters") and the respective clusters highlighted in UMAP plot.

# ICP clusters: identify the ICP probability table with the highest standard deviation
probs <- GetCellClusterProbability(object = sce, icp.round = 4, concatenate = FALSE)
probs.sd <- lapply(X = probs, FUN = function(x) {
    sd(x)
})
icp.run <- which.max(probs.sd) # 7
clt <- paste0("icp_run_round_", icp.run, "_4_clusters")
sce[[clt]] <- factor(sce[[clt]], levels = as.character(1:16))
PlotDimRed(sce,
    color.by = clt, point.size = 0.1,
    point.stroke = 0.1, legend.nrow = 5
)

The gene coefficients for the above ICP clusters corresponding to the ICP run no. 14 and clustering round 4 can be retrieved with the function GetFeatureCoefficients(). The positive coefficients at least for one of the clusters were represented in the heatmap plot below.

# Get gene coefficients
gene.coeff <- GetFeatureCoefficients(sce, icp.run = icp.run, icp.round = 4)
row.names(gene.coeff$icp_56) <- gene.coeff$icp_56$feature
gene.coeff$icp_56 <- gene.coeff$icp_56[, -1]

# Plot top positive coefficients in heatmap
positive.coeff <- (rowSums(gene.coeff$icp_56 > 0) > 0)
heat <- ComplexHeatmap::Heatmap(
    matrix = as.matrix(gene.coeff$icp_56)[positive.coeff, ],
    name = "Gene coef.",
    cluster_columns = FALSE, show_row_dend = FALSE, show_row_names = TRUE,
    row_names_gp = grid::gpar(fontsize = 7)
)
print(heat)

One of the largest population of luminal cells is constituted by three ICP clusters (3, 11, 12), one could check below the main coefficients for each one of these clusters.

# Check top gene coefficients per basal cluster: 3, 11, 12
gene.coeff.basal <- gene.coeff$icp_56[positive.coeff, ] %>%
    mutate("gene" = row.names(.)) 
top5.luminal <- top5.luminal.plts <- list()
for (i in as.character(c(3, 11, 12))) {
    clt.no <- paste0("clt", i)
    top5.luminal[[clt.no]] <- gene.coeff.basal %>%
        dplyr::select(all_of(c("gene", paste0("coeff_", clt.no)))) %>%
        arrange(desc(.data[[paste0("coeff_", clt.no)]])) %>%
        head(5) %>%
        pull(gene)
    top5.luminal.plts[[clt.no]] <- cowplot::plot_grid(plotlist = lapply(top5.luminal[[clt.no]], function(x) {
        PlotExpression(sce,
            color.by = x, point.size = 0.1, point.stroke = 0.1,
            scale.values = TRUE
        ) + ggplot2::ggtitle(gsub("clt", "cluster: ", clt.no))
    }), ncol = 5, align = "vh")
}
cowplot::plot_grid(plotlist = top5.luminal.plts, nrow = 3, align = "vh")

0.8 Graph-based clustering

Graph-based clustering can be obtained by running the scran function clusterCells() using the Coralysis integrated embedding. The community detection algorithm used was Louvain with a resolution value of 0.8.

### Graph-based clustering with scran
## Coralysis integrated PCA embedding
bluster.params <- bluster::SNNGraphParam(
    k = 30, cluster.fun = "louvain",
    cluster.args = list(resolution = 0.8)
)
set.seed(1024)
sce$Cluster <- scran::clusterCells(sce,
    use.dimred = "PCA",
    BLUSPARAM = bluster.params
)
PlotDimRed(sce,
    color.by = "Cluster", point.size = 0.2,
    point.stroke = 0.2, legend.nrow = 4
)

0.9 Cell cluster probability bins

Cell cluster probability bins per cluster label can be obtained by running BinCellClusterProbability(), which returns a SingleCellExperiment object with logcounts average per label per cell probability bin. The number of probability bins can be provided to the parameter bins. Below it was used as label the graph-based clustering obtained above and bins set to 30. In addition, the Coralysis probability bins were projected onto single cell UMAP. See the plots below.

# cell states SCE object
sce.bins <- BinCellClusterProbability(sce, label = "Cluster", bins = 30)

# Project Coralysis bins onto single-cell UMAP
sce.bins <- ReferenceMapping(sce, sce.bins, ref.label = "Cluster", project.umap = TRUE)
umap.bins.labels <- PlotDimRed(sce.bins, color.by = "coral_labels")
umap.bins.probs <- PlotExpression(sce.bins,
    color.by = "aggregated_probability_bins",
    color.scale = "viridis",
    legend.title = "Prob. bins"
)
cowplot::plot_grid(umap.bins.labels, umap.bins.probs, ncol = 2, align = "vh")

# Coralysis labels: single cells vs bins
cowplot::plot_grid(
    PlotDimRed(sce,
        color.by = "Cluster", point.size = 0.2,
        point.stroke = 0.2, legend.nrow = 2
    ),
    umap.bins.labels
)

# Coralysis probability: single cells vs bins
cowplot::plot_grid(
    PlotExpression(
        object = sce, color.by = "scaled_mean_probs",
        color.scale = "viridis", point.size = 0.2,
        point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)"
    ),
    umap.bins.probs
)

0.10 Differential expression programs

The correlation between cell cluster probability bins from cluster 10 and averaged gene expression can be obtained with the function CellBinsFeatureCorrelation(). The 30 genes with the highest absolute Pearson correlation were represented in the heatmap below.

# Differential expression programs
corr.features <- CellBinsFeatureCorrelation(object = sce.bins, labels = "10")
top30.corr.clt10 <- corr.features %>%
    arrange(desc(abs(`10`))) %>%
    head(30)

# Heatmap of top 30 genes with the highest absolute Pearson correlation
pick.genes <- row.names(top30.corr.clt10)
mtx <- as.matrix(logcounts(sce.bins[pick.genes, paste0("10_bin", 1:30)]))
mtx <- t(scale(t(mtx)))
col_fun2 <- circlize::colorRamp2(c(0.8, 0.9, 1.0), scales::viridis_pal()(3))
heat.diff.exp <- Heatmap(
    matrix = mtx, name = "Row Z-score\n(normalized data)",
    cluster_rows = TRUE, cluster_columns = FALSE,
    show_column_dend = FALSE, show_row_dend = FALSE,
    use_raster = TRUE, show_column_names = TRUE,
    top_annotation = HeatmapAnnotation(
        "Probability" = colData(sce.bins[, paste0("10_bin", 1:30)])$aggregated_probability_bins,
        simple_anno_size = unit(0.25, "cm"),
        col = list("Probability" = col_fun2), show_annotation_name = FALSE,
        show_legend = TRUE,
        annotation_legend_param = list(Probability = list(
            direction = "horizontal",
            title_position = "topcenter"
        ))
    )
)
print(heat.diff.exp)

For example, the expression of the gene Glycam1 (glycosylation-dependent cell adhesion molecule 1) or Fabp3 (fatty acid binding protein 3) are almost restricted to cluster 10, and they exhibit a gradient that correlates with both transient and steady-state cells.

# Look into Cited1 expression
genes <- c("Glycam1", "Fabp3")
cowplot::plot_grid(
    PlotExpression(sce,
        color.by = genes[1], point.size = 0.5,
        point.stroke = 0.5, scale.values = TRUE
    ),
    PlotExpression(sce.bins,
        color.by = genes[1], point.size = 1,
        point.stroke = 1, scale.values = TRUE
    ),
    PlotExpression(sce,
        color.by = genes[2], point.size = 0.5,
        point.stroke = 0.5, scale.values = TRUE
    ),
    PlotExpression(sce.bins,
        color.by = genes[2], point.size = 1,
        point.stroke = 1, scale.values = TRUE
    ),
    
    ncol = 2
)

0.11 R session

# R session
sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ensembldb_2.33.1            AnnotationFilter_1.33.0    
##  [3] GenomicFeatures_1.61.4      AnnotationDbi_1.71.0       
##  [5] ComplexHeatmap_2.25.2       Coralysis_0.99.10          
##  [7] scRNAseq_2.23.0             scater_1.37.0              
##  [9] ggplot2_3.5.2               scuttle_1.19.0             
## [11] SingleCellExperiment_1.31.1 SummarizedExperiment_1.39.1
## [13] Biobase_2.69.0              GenomicRanges_1.61.1       
## [15] Seqinfo_0.99.1              IRanges_2.43.0             
## [17] S4Vectors_0.47.0            BiocGenerics_0.55.0        
## [19] generics_0.1.4              MatrixGenerics_1.21.0      
## [21] matrixStats_1.5.0           dplyr_1.1.4                
## [23] BiocStyle_2.37.0           
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22         BiocIO_1.19.0            bitops_1.0-9            
##   [4] filelock_1.0.3           tibble_3.3.0             XML_3.99-0.18           
##   [7] lifecycle_1.0.4          httr2_1.2.0              aricode_1.0.3           
##  [10] edgeR_4.7.3              doParallel_1.0.17        lattice_0.22-7          
##  [13] alabaster.base_1.9.4     magrittr_2.0.3           limma_3.65.1            
##  [16] sass_0.4.10              rmarkdown_2.29           jquerylib_0.1.4         
##  [19] yaml_2.3.10              metapod_1.17.0           cowplot_1.2.0           
##  [22] LiblineaR_2.10-24        DBI_1.2.3                RColorBrewer_1.1-3      
##  [25] abind_1.4-8              purrr_1.1.0              RCurl_1.98-1.17         
##  [28] rappdirs_0.3.3           circlize_0.4.16          ggrepel_0.9.6           
##  [31] irlba_2.3.5.1            alabaster.sce_1.9.0      pheatmap_1.0.13         
##  [34] RSpectra_0.16-2          dqrng_0.4.1              codetools_0.2-20        
##  [37] DelayedArray_0.35.2      shape_1.4.6.1            tidyselect_1.2.1        
##  [40] UCSC.utils_1.5.0         farver_2.1.2             ScaledMatrix_1.17.0     
##  [43] viridis_0.6.5            BiocFileCache_2.99.5     GenomicAlignments_1.45.1
##  [46] jsonlite_2.0.0           GetoptLong_1.0.5         BiocNeighbors_2.3.1     
##  [49] iterators_1.0.14         foreach_1.5.2            tools_4.5.1             
##  [52] Rcpp_1.1.0               glue_1.8.0               gridExtra_2.3           
##  [55] SparseArray_1.9.0        xfun_0.52                flexclust_1.5.0         
##  [58] GenomeInfoDb_1.45.7      HDF5Array_1.37.0         gypsum_1.5.0            
##  [61] withr_3.0.2              BiocManager_1.30.26      fastmap_1.2.0           
##  [64] rhdf5filters_1.21.0      bluster_1.19.0           SparseM_1.84-2          
##  [67] digest_0.6.37            rsvd_1.0.5               R6_2.6.1                
##  [70] colorspace_2.1-1         Cairo_1.6-2              dichromat_2.0-0.1       
##  [73] RSQLite_2.4.1            h5mread_1.1.1            rtracklayer_1.69.1      
##  [76] class_7.3-23             httr_1.4.7               S4Arrays_1.9.1          
##  [79] uwot_0.2.3               pkgconfig_2.0.3          gtable_0.3.6            
##  [82] modeltools_0.2-24        blob_1.2.4               XVector_0.49.0          
##  [85] htmltools_0.5.8.1        bookdown_0.43            clue_0.3-66             
##  [88] ProtGenerics_1.41.0      scales_1.4.0             alabaster.matrix_1.9.0  
##  [91] png_0.1-8                scran_1.37.0             knitr_1.50              
##  [94] reshape2_1.4.4           rjson_0.2.23             curl_6.4.0              
##  [97] GlobalOptions_0.1.2      cachem_1.1.0             rhdf5_2.53.1            
## [100] stringr_1.5.1            BiocVersion_3.22.0       parallel_4.5.1          
## [103] vipor_0.4.7              ggrastr_1.0.2            restfulr_0.0.16         
## [106] pillar_1.11.0            alabaster.schemas_1.9.0  vctrs_0.6.5             
## [109] RANN_2.6.2               BiocSingular_1.25.0      dbplyr_2.5.0            
## [112] beachmat_2.25.2          cluster_2.1.8.1          beeswarm_0.4.0          
## [115] evaluate_1.0.4           magick_2.8.7             tinytex_0.57            
## [118] cli_3.6.5                locfit_1.5-9.12          compiler_4.5.1          
## [121] Rsamtools_2.25.1         rlang_1.1.6              crayon_1.5.3            
## [124] labeling_0.4.3           plyr_1.8.9               ggbeeswarm_0.7.2        
## [127] stringi_1.8.7            viridisLite_0.4.2        alabaster.se_1.9.0      
## [130] BiocParallel_1.43.4      Biostrings_2.77.2        lazyeval_0.2.2          
## [133] Matrix_1.7-3             ExperimentHub_2.99.5     sparseMatrixStats_1.21.0
## [136] bit64_4.6.0-1            Rhdf5lib_1.31.0          KEGGREST_1.49.1         
## [139] statmod_1.5.0            alabaster.ranges_1.9.0   AnnotationHub_3.99.6    
## [142] igraph_2.1.4             memoise_2.0.1            bslib_0.9.0             
## [145] bit_4.6.0

0.12 References

Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.

Bach K, Pensa S, Grzelak M, Hadfield J, Adams DJ, Marioni JC, Khaled WT (2017). “Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing.” Nature communications, 8, 1:1-11. doi:10.1038/s41467-017-02001-5

Gu, Z. (2022) “Complex heatmap visualization.” iMeta, 1(3):e43. doi: 10.1002/imt2.43.

Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi:10.12688/f1000research.9501.2.

McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). “Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.” Bioinformatics, 33, 1179-1186. doi:10.1093/bioinformatics/btw777.

Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023

Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.