The single-cell RNA sequencing dataset of mammary epithelial cells published by Bach et al., (2017) will be used to demonstrated how cell cluster probability can be used to identify cell states and associated differential gene expression programs.
This dataset was imported as a SingleCellExperiment
object via the scRNAseq
R
package using the function BachMammaryData()
. Additionally, the dataset was normalized with the scater
function logNormCounts()
, and then subset to 2,000 highly variable genes with the scran
function getTopHVGs()
.
# Import packages
library("dplyr")
library("scater")
library("ggplot2")
library("scRNAseq")
library("Coralysis")
library("ComplexHeatmap")
library("SingleCellExperiment")
# Import object
sce <- BachMammaryData()
colnames(sce) <- paste0("cell", 1:ncol(sce)) # create cell names
Coralysis
requires log-normalised data as input. scran
normalization can be performed, which is particularly beneficial if rare cell types exist (see the following vignette).
## Normalize the data
set.seed(123)
sce <- scater::logNormCounts(sce)
counts(sce) <- NULL
Highly variable genes (HVG) can be selected using the R package scran
.
# Feature selection with 'scran' package
nhvg <- 500
hvg <- scran::getTopHVGs(sce, n = nhvg)
hvg.idx <- which(row.names(sce) %in% hvg)
sce <- sce[hvg.idx, ]
row.names(sce) <- rowData(sce)$Symbol
Multi-level integration can be performed with Coralysis
by running the function RunParallelDivisiveICP()
. Since there is not a batch, the function RunParallelDivisiveICP()
is run with batch.label=NULL
. The RunParallelDivisiveICP()
function can be run in parallel by providing the number of threads
to reduce the run time. Consult the full documentation of this function with ?RunParallelDivisiveICP
. For this example, the algorithm was run 10 times (L = 25
), but generally, this number should be higher (with the default being L = 50
). A train set can be built to improve the modeling step in Coralysis
(by default, build.train.set=TRUE
), with the respective named list
of parameters provided to build.train.params
. In this case, no training set was built; instead, the entire dataset was downsampled to 1,000 cells before each ICP run (icp.batch.size=1000
).
# Perform multi-level integration
set.seed(123)
sce <- RunParallelDivisiveICP(object = sce, L = 25,
icp.batch.size = 1000,
build.train.set = FALSE,
threads = 2)
## WARNING: Setting 'divisive.method' to 'cluster' as 'batch.label=NULL'.
## If 'batch.label=NULL', 'divisive.method' can be one of: 'cluster', 'random'.
##
## Computing cluster seed.
##
## Initializing divisive ICP clustering...
##
|
| | 0%
|
|=== | 4%
|
|====== | 8%
|
|======== | 12%
|
|=========== | 16%
|
|============== | 20%
|
|================= | 24%
|
|==================== | 28%
|
|====================== | 32%
|
|========================= | 36%
|
|============================ | 40%
|
|=============================== | 44%
|
|================================== | 48%
|
|==================================== | 52%
|
|======================================= | 56%
|
|========================================== | 60%
|
|============================================= | 64%
|
|================================================ | 68%
|
|================================================== | 72%
|
|===================================================== | 76%
|
|======================================================== | 80%
|
|=========================================================== | 84%
|
|============================================================== | 88%
|
|================================================================ | 92%
|
|=================================================================== | 96%
|
|======================================================================| 100%
##
## Divisive ICP clustering completed successfully.
##
## Predicting cell cluster probabilities using ICP models...
## Prediction of cell cluster probabilities completed successfully.
##
## Multi-level integration completed successfully.
Coralysis
returns a list of cell cluster probabilities saved at metadata(sce)$coralysis$joint.probability
of length equal to the number of icp runs, i.e., L
(by default L=20
), times the number of icp rounds, i.e., log2(k
) (by default k=16
, thus 4 rounds of divisive icp).
Coralysis
returns a list of cell cluster probabilities saved at metadata(sce)$coralysis$joint.probability
of length equal to the number of icp runs, i.e., L
(by default L=20
), times the number of icp rounds, i.e., log2(k
) (by default k=16
, thus 4 rounds of divisive icp). The cell cluster probability can be concatenated to compute a PCA in order to obtain an integrated embedding. By default only the cell cluster probability corresponding to the last icp round is used. The Coralysis
integrated embedding can be used downstream for non-linear dimensional reduction, t-SNE or UMAP, and clustering. Set to TRUE
the return.model
argument to allow reference mapping later.
# Dimensional reduction - unintegrated
set.seed(123)
sce <- RunPCA(
object = sce, assay.name = "joint.probability",
return.model = TRUE
)
## Divisive ICP: selecting ICP tables multiple of 4
## Setting 'pca.method' to 'stats' as 'return.model' is TRUE.
## Set 'return.model' to FALSE to use 'method' 'irlba'.
# UMAP
set.seed(123)
sce <- RunUMAP(
object = sce, umap.method = "uwot",
dims = 1:30, n_neighbors = 15,
min_dist = 0.5, return.model = TRUE
)
Check the difference between basal (expressing Krt5) and luminal (expressing Krt18) epithelial cells.
# Detect basal versus luminal cells
markers <- c("Krt5", "Krt18")
plots <- lapply(markers, function(x) {
PlotExpression(sce,
color.by = x, point.size = 0.2,
point.stroke = 0.2, scale.values = TRUE
)
})
cowplot::plot_grid(plotlist = plots, ncol = 2)
The cell cluster probability aggregated by mean or median across the icp runs can be obtained with the function SummariseCellClusterProbability()
and plotted to infer transient and steady cell states.
# Summarise cell cluster probability
sce <- SummariseCellClusterProbability(object = sce, icp.round = 4) # save result in 'colData'
# colData(sce) # check the colData
It is clear that the largest basal cell population has the lowest probability compared with the luminal cell populations.
# Plot cell cluster probabilities - mean
# possible options: "mean_probs", "median_probs", "scaled_median_probs"
PlotExpression(
object = sce, color.by = "scaled_mean_probs",
color.scale = "viridis", point.size = 0.2,
point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)"
)
The ICP clusters corresponding to the ICP probability table with the highest standard deviation (i.e., 14) for the last clustering round (i.e., 4) was obtained below ("icp_run_round_14_4_clusters"
) and the respective clusters highlighted in UMAP plot.
# ICP clusters: identify the ICP probability table with the highest standard deviation
probs <- GetCellClusterProbability(object = sce, icp.round = 4, concatenate = FALSE)
probs.sd <- lapply(X = probs, FUN = function(x) {
sd(x)
})
icp.run <- which.max(probs.sd) # 7
clt <- paste0("icp_run_round_", icp.run, "_4_clusters")
sce[[clt]] <- factor(sce[[clt]], levels = as.character(1:16))
PlotDimRed(sce,
color.by = clt, point.size = 0.1,
point.stroke = 0.1, legend.nrow = 5
)
The gene coefficients for the above ICP clusters corresponding to the ICP run no. 14 and clustering round 4 can be retrieved with the function GetFeatureCoefficients()
. The positive coefficients at least for one of the clusters were represented in the heatmap plot below.
# Get gene coefficients
gene.coeff <- GetFeatureCoefficients(sce, icp.run = icp.run, icp.round = 4)
row.names(gene.coeff$icp_56) <- gene.coeff$icp_56$feature
gene.coeff$icp_56 <- gene.coeff$icp_56[, -1]
# Plot top positive coefficients in heatmap
positive.coeff <- (rowSums(gene.coeff$icp_56 > 0) > 0)
heat <- ComplexHeatmap::Heatmap(
matrix = as.matrix(gene.coeff$icp_56)[positive.coeff, ],
name = "Gene coef.",
cluster_columns = FALSE, show_row_dend = FALSE, show_row_names = TRUE,
row_names_gp = grid::gpar(fontsize = 7)
)
print(heat)
One of the largest population of luminal cells is constituted by three ICP clusters (3, 11, 12), one could check below the main coefficients for each one of these clusters.
# Check top gene coefficients per basal cluster: 3, 11, 12
gene.coeff.basal <- gene.coeff$icp_56[positive.coeff, ] %>%
mutate("gene" = row.names(.))
top5.luminal <- top5.luminal.plts <- list()
for (i in as.character(c(3, 11, 12))) {
clt.no <- paste0("clt", i)
top5.luminal[[clt.no]] <- gene.coeff.basal %>%
dplyr::select(all_of(c("gene", paste0("coeff_", clt.no)))) %>%
arrange(desc(.data[[paste0("coeff_", clt.no)]])) %>%
head(5) %>%
pull(gene)
top5.luminal.plts[[clt.no]] <- cowplot::plot_grid(plotlist = lapply(top5.luminal[[clt.no]], function(x) {
PlotExpression(sce,
color.by = x, point.size = 0.1, point.stroke = 0.1,
scale.values = TRUE
) + ggplot2::ggtitle(gsub("clt", "cluster: ", clt.no))
}), ncol = 5, align = "vh")
}
cowplot::plot_grid(plotlist = top5.luminal.plts, nrow = 3, align = "vh")
Graph-based clustering can be obtained by running the scran
function clusterCells()
using the Coralysis
integrated embedding. The community detection algorithm used was Louvain with a resolution value of 0.8.
### Graph-based clustering with scran
## Coralysis integrated PCA embedding
bluster.params <- bluster::SNNGraphParam(
k = 30, cluster.fun = "louvain",
cluster.args = list(resolution = 0.8)
)
set.seed(1024)
sce$Cluster <- scran::clusterCells(sce,
use.dimred = "PCA",
BLUSPARAM = bluster.params
)
PlotDimRed(sce,
color.by = "Cluster", point.size = 0.2,
point.stroke = 0.2, legend.nrow = 4
)
Cell cluster probability bins per cluster label
can be obtained by running BinCellClusterProbability()
, which returns a SingleCellExperiment
object with logcounts
average per label
per cell probability bin. The number of probability bins can be provided to the parameter bins
. Below it was used as label
the graph-based clustering obtained above and bins
set to 30. In addition, the Coralysis
probability bins were projected onto single cell UMAP. See the plots below.
# cell states SCE object
sce.bins <- BinCellClusterProbability(sce, label = "Cluster", bins = 30)
# Project Coralysis bins onto single-cell UMAP
sce.bins <- ReferenceMapping(sce, sce.bins, ref.label = "Cluster", project.umap = TRUE)
umap.bins.labels <- PlotDimRed(sce.bins, color.by = "coral_labels")
umap.bins.probs <- PlotExpression(sce.bins,
color.by = "aggregated_probability_bins",
color.scale = "viridis",
legend.title = "Prob. bins"
)
cowplot::plot_grid(umap.bins.labels, umap.bins.probs, ncol = 2, align = "vh")
# Coralysis labels: single cells vs bins
cowplot::plot_grid(
PlotDimRed(sce,
color.by = "Cluster", point.size = 0.2,
point.stroke = 0.2, legend.nrow = 2
),
umap.bins.labels
)
# Coralysis probability: single cells vs bins
cowplot::plot_grid(
PlotExpression(
object = sce, color.by = "scaled_mean_probs",
color.scale = "viridis", point.size = 0.2,
point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)"
),
umap.bins.probs
)
The correlation between cell cluster probability bins from cluster 10 and averaged gene expression can be obtained with the function CellBinsFeatureCorrelation()
. The 30 genes with the highest absolute Pearson correlation were represented in the heatmap below.
# Differential expression programs
corr.features <- CellBinsFeatureCorrelation(object = sce.bins, labels = "10")
top30.corr.clt10 <- corr.features %>%
arrange(desc(abs(`10`))) %>%
head(30)
# Heatmap of top 30 genes with the highest absolute Pearson correlation
pick.genes <- row.names(top30.corr.clt10)
mtx <- as.matrix(logcounts(sce.bins[pick.genes, paste0("10_bin", 1:30)]))
mtx <- t(scale(t(mtx)))
col_fun2 <- circlize::colorRamp2(c(0.8, 0.9, 1.0), scales::viridis_pal()(3))
heat.diff.exp <- Heatmap(
matrix = mtx, name = "Row Z-score\n(normalized data)",
cluster_rows = TRUE, cluster_columns = FALSE,
show_column_dend = FALSE, show_row_dend = FALSE,
use_raster = TRUE, show_column_names = TRUE,
top_annotation = HeatmapAnnotation(
"Probability" = colData(sce.bins[, paste0("10_bin", 1:30)])$aggregated_probability_bins,
simple_anno_size = unit(0.25, "cm"),
col = list("Probability" = col_fun2), show_annotation_name = FALSE,
show_legend = TRUE,
annotation_legend_param = list(Probability = list(
direction = "horizontal",
title_position = "topcenter"
))
)
)
print(heat.diff.exp)
For example, the expression of the gene Glycam1 (glycosylation-dependent cell adhesion molecule 1) or Fabp3 (fatty acid binding protein 3) are almost restricted to cluster 10, and they exhibit a gradient that correlates with both transient and steady-state cells.
# Look into Cited1 expression
genes <- c("Glycam1", "Fabp3")
cowplot::plot_grid(
PlotExpression(sce,
color.by = genes[1], point.size = 0.5,
point.stroke = 0.5, scale.values = TRUE
),
PlotExpression(sce.bins,
color.by = genes[1], point.size = 1,
point.stroke = 1, scale.values = TRUE
),
PlotExpression(sce,
color.by = genes[2], point.size = 0.5,
point.stroke = 0.5, scale.values = TRUE
),
PlotExpression(sce.bins,
color.by = genes[2], point.size = 1,
point.stroke = 1, scale.values = TRUE
),
ncol = 2
)
# R session
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ensembldb_2.33.0 AnnotationFilter_1.33.0
## [3] GenomicFeatures_1.61.0 AnnotationDbi_1.71.0
## [5] ComplexHeatmap_2.25.0 Coralysis_0.99.6
## [7] scRNAseq_2.23.0 scater_1.37.0
## [9] ggplot2_3.5.2 scuttle_1.19.0
## [11] SingleCellExperiment_1.31.0 SummarizedExperiment_1.39.0
## [13] Biobase_2.69.0 GenomicRanges_1.61.0
## [15] GenomeInfoDb_1.45.3 IRanges_2.43.0
## [17] S4Vectors_0.47.0 BiocGenerics_0.55.0
## [19] generics_0.1.3 MatrixGenerics_1.21.0
## [21] matrixStats_1.5.0 dplyr_1.1.4
## [23] BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 BiocIO_1.19.0 bitops_1.0-9
## [4] filelock_1.0.3 tibble_3.2.1 XML_3.99-0.18
## [7] lifecycle_1.0.4 httr2_1.1.2 aricode_1.0.3
## [10] edgeR_4.7.2 doParallel_1.0.17 lattice_0.22-7
## [13] alabaster.base_1.9.0 magrittr_2.0.3 limma_3.65.0
## [16] sass_0.4.10 rmarkdown_2.29 jquerylib_0.1.4
## [19] yaml_2.3.10 metapod_1.17.0 cowplot_1.1.3
## [22] LiblineaR_2.10-24 DBI_1.2.3 RColorBrewer_1.1-3
## [25] abind_1.4-8 purrr_1.0.4 RCurl_1.98-1.17
## [28] rappdirs_0.3.3 circlize_0.4.16 ggrepel_0.9.6
## [31] irlba_2.3.5.1 alabaster.sce_1.9.0 pheatmap_1.0.12
## [34] RSpectra_0.16-2 dqrng_0.4.1 codetools_0.2-20
## [37] DelayedArray_0.35.1 shape_1.4.6.1 tidyselect_1.2.1
## [40] UCSC.utils_1.5.0 farver_2.1.2 ScaledMatrix_1.17.0
## [43] viridis_0.6.5 BiocFileCache_2.99.0 GenomicAlignments_1.45.0
## [46] jsonlite_2.0.0 GetoptLong_1.0.5 BiocNeighbors_2.3.0
## [49] iterators_1.0.14 foreach_1.5.2 tools_4.5.0
## [52] Rcpp_1.0.14 glue_1.8.0 gridExtra_2.3
## [55] SparseArray_1.9.0 xfun_0.52 flexclust_1.5.0
## [58] HDF5Array_1.37.0 gypsum_1.5.0 withr_3.0.2
## [61] BiocManager_1.30.25 fastmap_1.2.0 rhdf5filters_1.21.0
## [64] bluster_1.19.0 SparseM_1.84-2 digest_0.6.37
## [67] rsvd_1.0.5 R6_2.6.1 colorspace_2.1-1
## [70] Cairo_1.6-2 dichromat_2.0-0.1 RSQLite_2.3.11
## [73] h5mread_1.1.0 rtracklayer_1.69.0 class_7.3-23
## [76] httr_1.4.7 S4Arrays_1.9.0 uwot_0.2.3
## [79] pkgconfig_2.0.3 gtable_0.3.6 modeltools_0.2-24
## [82] blob_1.2.4 XVector_0.49.0 htmltools_0.5.8.1
## [85] bookdown_0.43 clue_0.3-66 ProtGenerics_1.41.0
## [88] scales_1.4.0 alabaster.matrix_1.9.0 png_0.1-8
## [91] scran_1.37.0 knitr_1.50 reshape2_1.4.4
## [94] rjson_0.2.23 curl_6.2.2 GlobalOptions_0.1.2
## [97] cachem_1.1.0 rhdf5_2.53.0 stringr_1.5.1
## [100] BiocVersion_3.22.0 parallel_4.5.0 vipor_0.4.7
## [103] ggrastr_1.0.2 restfulr_0.0.15 pillar_1.10.2
## [106] alabaster.schemas_1.9.0 vctrs_0.6.5 RANN_2.6.2
## [109] BiocSingular_1.25.0 dbplyr_2.5.0 beachmat_2.25.0
## [112] cluster_2.1.8.1 beeswarm_0.4.0 evaluate_1.0.3
## [115] magick_2.8.6 tinytex_0.57 cli_3.6.5
## [118] locfit_1.5-9.12 compiler_4.5.0 Rsamtools_2.25.0
## [121] rlang_1.1.6 crayon_1.5.3 labeling_0.4.3
## [124] plyr_1.8.9 ggbeeswarm_0.7.2 stringi_1.8.7
## [127] viridisLite_0.4.2 alabaster.se_1.9.0 BiocParallel_1.43.0
## [130] Biostrings_2.77.0 lazyeval_0.2.2 Matrix_1.7-3
## [133] ExperimentHub_2.99.0 sparseMatrixStats_1.21.0 bit64_4.6.0-1
## [136] Rhdf5lib_1.31.0 KEGGREST_1.49.0 statmod_1.5.0
## [139] alabaster.ranges_1.9.0 AnnotationHub_3.99.1 igraph_2.1.4
## [142] memoise_2.0.1 bslib_0.9.0 bit_4.6.0
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.
Bach K, Pensa S, Grzelak M, Hadfield J, Adams DJ, Marioni JC, Khaled WT (2017). “Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing.” Nature communications, 8, 1:1-11. doi:10.1038/s41467-017-02001-5
Gu, Z. (2022) “Complex heatmap visualization.” iMeta, 1(3):e43. doi: 10.1002/imt2.43.
Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi:10.12688/f1000research.9501.2.
McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). “Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.” Bioinformatics, 33, 1179-1186. doi:10.1093/bioinformatics/btw777.
Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023
Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.