The single-cell RNA sequencing dataset of mammary epithelial cells published by Bach et al., (2017) will be used to demonstrated how cell cluster probability can be used to identify cell states and associated differential gene expression programs.
This dataset was imported as a SingleCellExperiment
object via the scRNAseq R package using the
function BachMammaryData(). Additionally, the dataset was
normalized with the scater function
logNormCounts(), and then subset to 2,000 highly variable
genes with the scran function
getTopHVGs().
# Import packages
library("dplyr")
library("scater")
library("ggplot2")
library("scRNAseq")
library("Coralysis")
library("ComplexHeatmap")
library("SingleCellExperiment")Coralysis requires log-normalised data as input.
scran normalization can be performed, which is particularly
beneficial if rare cell types exist (see the following vignette).
Highly variable genes (HVG) can be selected using the R package
scran.
# Feature selection with 'scran' package
nhvg <- 500
hvg <- scran::getTopHVGs(sce, n = nhvg)
hvg.idx <- which(row.names(sce) %in% hvg)
sce <- sce[hvg.idx, ]
row.names(sce) <- rowData(sce)$SymbolMulti-level integration can be performed with Coralysis
by running the function RunParallelDivisiveICP(). Since
there is not a batch, the function RunParallelDivisiveICP()
is run with batch.label=NULL. The
RunParallelDivisiveICP() function can be run in parallel by
providing the number of threads to reduce the run time.
Consult the full documentation of this function with
?RunParallelDivisiveICP. For this example, the algorithm
was run 10 times (L = 25), but generally, this number
should be higher (with the default being L = 50). A train
set can be built to improve the modeling step in Coralysis
(by default, build.train.set=TRUE), with the respective
named list of parameters provided to
build.train.params. In this case, no training set was
built; instead, the entire dataset was downsampled to 1,000 cells before
each ICP run (icp.batch.size=1000).
# Perform multi-level integration
set.seed(123)
sce <- RunParallelDivisiveICP(object = sce, L = 25,
icp.batch.size = 1000,
build.train.set = FALSE,
threads = 2)## WARNING: Setting 'divisive.method' to 'cluster' as 'batch.label=NULL'.
## If 'batch.label=NULL', 'divisive.method' can be one of: 'cluster', 'random'.
##
## Computing cluster seed.
##
## Initializing divisive ICP clustering...
## | | | 0% | |=== | 4% | |====== | 8% | |======== | 12% | |=========== | 16% | |============== | 20% | |================= | 24% | |==================== | 28% | |====================== | 32% | |========================= | 36% | |============================ | 40% | |=============================== | 44% | |================================== | 48% | |==================================== | 52% | |======================================= | 56% | |========================================== | 60% | |============================================= | 64% | |================================================ | 68% | |================================================== | 72% | |===================================================== | 76% | |======================================================== | 80% | |=========================================================== | 84% | |============================================================== | 88% | |================================================================ | 92% | |=================================================================== | 96% | |======================================================================| 100%
##
## Divisive ICP clustering completed successfully.
##
## Predicting cell cluster probabilities using ICP models...
## Prediction of cell cluster probabilities completed successfully.
##
## Multi-level integration completed successfully.
Coralysis returns a list of cell cluster probabilities
saved at metadata(sce)$coralysis$joint.probability of
length equal to the number of icp runs, i.e., L (by default
L=20), times the number of icp rounds, i.e.,
log2(k) (by default k=16, thus 4 rounds of
divisive icp).
Coralysis returns a list of cell cluster probabilities
saved at metadata(sce)$coralysis$joint.probability of
length equal to the number of icp runs, i.e., L (by default
L=20), times the number of icp rounds, i.e.,
log2(k) (by default k=16, thus 4 rounds of
divisive icp). The cell cluster probability can be concatenated to
compute a PCA in order to obtain an integrated embedding. By default
only the cell cluster probability corresponding to the last icp round is
used. The Coralysis integrated embedding can be used
downstream for non-linear dimensional reduction, t-SNE or UMAP,
and clustering. Set to TRUE the return.model
argument to allow reference mapping later.
# Dimensional reduction - unintegrated
set.seed(123)
sce <- RunPCA(
object = sce, assay.name = "joint.probability",
return.model = TRUE
)## Divisive ICP: selecting ICP tables multiple of 4
## Setting 'pca.method' to 'stats' as 'return.model' is TRUE.
## Set 'return.model' to FALSE to use 'method' 'irlba'.
# UMAP
set.seed(123)
sce <- RunUMAP(
object = sce, umap.method = "uwot",
dims = 1:30, n_neighbors = 15,
min_dist = 0.5, return.model = TRUE
)Check the difference between basal (expressing Krt5) and luminal (expressing Krt18) epithelial cells.
# Detect basal versus luminal cells
markers <- c("Krt5", "Krt18")
plots <- lapply(markers, function(x) {
PlotExpression(sce,
color.by = x, point.size = 0.2,
point.stroke = 0.2, scale.values = TRUE
)
})
cowplot::plot_grid(plotlist = plots, ncol = 2)The cell cluster probability aggregated by mean or median across the
icp runs can be obtained with the function
SummariseCellClusterProbability() and plotted to infer
transient and steady cell states.
# Summarise cell cluster probability
sce <- SummariseCellClusterProbability(object = sce, icp.round = 4) # save result in 'colData'
# colData(sce) # check the colDataIt is clear that the largest basal cell population has the lowest probability compared with the luminal cell populations.
# Plot cell cluster probabilities - mean
# possible options: "mean_probs", "median_probs", "scaled_median_probs"
PlotExpression(
object = sce, color.by = "scaled_mean_probs",
color.scale = "viridis", point.size = 0.2,
point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)"
)The ICP clusters corresponding to the ICP probability table with the
highest standard deviation (i.e., 14) for the last clustering round
(i.e., 4) was obtained below
("icp_run_round_14_4_clusters") and the respective clusters
highlighted in UMAP plot.
# ICP clusters: identify the ICP probability table with the highest standard deviation
probs <- GetCellClusterProbability(object = sce, icp.round = 4, concatenate = FALSE)
probs.sd <- lapply(X = probs, FUN = function(x) {
sd(x)
})
icp.run <- which.max(probs.sd) # 7
clt <- paste0("icp_run_round_", icp.run, "_4_clusters")
sce[[clt]] <- factor(sce[[clt]], levels = as.character(1:16))
PlotDimRed(sce,
color.by = clt, point.size = 0.1,
point.stroke = 0.1, legend.nrow = 5
)## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
The gene coefficients for the above ICP clusters corresponding to the
ICP run no. 14 and clustering round 4 can be retrieved with the function
GetFeatureCoefficients(). The positive coefficients at
least for one of the clusters were represented in the heatmap plot
below.
# Get gene coefficients
gene.coeff <- GetFeatureCoefficients(sce, icp.run = icp.run, icp.round = 4)
row.names(gene.coeff$icp_56) <- gene.coeff$icp_56$feature
gene.coeff$icp_56 <- gene.coeff$icp_56[, -1]
# Plot top positive coefficients in heatmap
positive.coeff <- (rowSums(gene.coeff$icp_56 > 0) > 0)
heat <- ComplexHeatmap::Heatmap(
matrix = as.matrix(gene.coeff$icp_56)[positive.coeff, ],
name = "Gene coef.",
cluster_columns = FALSE, show_row_dend = FALSE, show_row_names = TRUE,
row_names_gp = grid::gpar(fontsize = 7)
)
print(heat)One of the largest population of luminal cells is constituted by three ICP clusters (3, 11, 12), one could check below the main coefficients for each one of these clusters.
# Check top gene coefficients per basal cluster: 3, 11, 12
gene.coeff.basal <- gene.coeff$icp_56[positive.coeff, ] %>%
mutate("gene" = row.names(.))
top5.luminal <- top5.luminal.plts <- list()
for (i in as.character(c(3, 11, 12))) {
clt.no <- paste0("clt", i)
top5.luminal[[clt.no]] <- gene.coeff.basal %>%
dplyr::select(all_of(c("gene", paste0("coeff_", clt.no)))) %>%
arrange(desc(.data[[paste0("coeff_", clt.no)]])) %>%
head(5) %>%
pull(gene)
top5.luminal.plts[[clt.no]] <- cowplot::plot_grid(plotlist = lapply(top5.luminal[[clt.no]], function(x) {
PlotExpression(sce,
color.by = x, point.size = 0.1, point.stroke = 0.1,
scale.values = TRUE
) + ggplot2::ggtitle(gsub("clt", "cluster: ", clt.no))
}), ncol = 5, align = "vh")
}
cowplot::plot_grid(plotlist = top5.luminal.plts, nrow = 3, align = "vh")Graph-based clustering can be obtained by running the
scran function clusterCells() using the
Coralysis integrated embedding. The community detection
algorithm used was Louvain with a resolution value of 0.8.
### Graph-based clustering with scran
## Coralysis integrated PCA embedding
bluster.params <- bluster::SNNGraphParam(
k = 30, cluster.fun = "louvain",
cluster.args = list(resolution = 0.8)
)
set.seed(1024)
sce$Cluster <- scran::clusterCells(sce,
use.dimred = "PCA",
BLUSPARAM = bluster.params
)
PlotDimRed(sce,
color.by = "Cluster", point.size = 0.2,
point.stroke = 0.2, legend.nrow = 4
)## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
Cell cluster probability bins per cluster label can be
obtained by running BinCellClusterProbability(), which
returns a SingleCellExperiment object with
logcounts average per label per cell
probability bin. The number of probability bins can be provided to the
parameter bins. Below it was used as label the
graph-based clustering obtained above and bins set to 30.
In addition, the Coralysis probability bins were projected
onto single cell UMAP. See the plots below.
# cell states SCE object
sce.bins <- BinCellClusterProbability(sce, label = "Cluster", bins = 30)
# Project Coralysis bins onto single-cell UMAP
sce.bins <- ReferenceMapping(sce, sce.bins, ref.label = "Cluster", project.umap = TRUE)
umap.bins.labels <- PlotDimRed(sce.bins, color.by = "coral_labels")## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
umap.bins.probs <- PlotExpression(sce.bins,
color.by = "aggregated_probability_bins",
color.scale = "viridis",
legend.title = "Prob. bins"
)
cowplot::plot_grid(umap.bins.labels, umap.bins.probs, ncol = 2, align = "vh")# Coralysis labels: single cells vs bins
cowplot::plot_grid(
PlotDimRed(sce,
color.by = "Cluster", point.size = 0.2,
point.stroke = 0.2, legend.nrow = 2
),
umap.bins.labels
)## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
# Coralysis probability: single cells vs bins
cowplot::plot_grid(
PlotExpression(
object = sce, color.by = "scaled_mean_probs",
color.scale = "viridis", point.size = 0.2,
point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)"
),
umap.bins.probs
)The correlation between cell cluster probability bins from cluster 10
and averaged gene expression can be obtained with the function
CellBinsFeatureCorrelation(). The 30 genes with the highest
absolute Pearson correlation were represented in the heatmap below.
# Differential expression programs
corr.features <- CellBinsFeatureCorrelation(object = sce.bins, labels = "10")
top30.corr.clt10 <- corr.features %>%
arrange(desc(abs(`10`))) %>%
head(30)
# Heatmap of top 30 genes with the highest absolute Pearson correlation
pick.genes <- row.names(top30.corr.clt10)
mtx <- as.matrix(logcounts(sce.bins[pick.genes, paste0("10_bin", 1:30)]))
mtx <- t(scale(t(mtx)))
col_fun2 <- circlize::colorRamp2(c(0.8, 0.9, 1.0), scales::viridis_pal()(3))
heat.diff.exp <- Heatmap(
matrix = mtx, name = "Row Z-score\n(normalized data)",
cluster_rows = TRUE, cluster_columns = FALSE,
show_column_dend = FALSE, show_row_dend = FALSE,
use_raster = TRUE, show_column_names = TRUE,
top_annotation = HeatmapAnnotation(
"Probability" = colData(sce.bins[, paste0("10_bin", 1:30)])$aggregated_probability_bins,
simple_anno_size = unit(0.25, "cm"),
col = list("Probability" = col_fun2), show_annotation_name = FALSE,
show_legend = TRUE,
annotation_legend_param = list(Probability = list(
direction = "horizontal",
title_position = "topcenter"
))
)
)
print(heat.diff.exp)For example, the expression of the gene Glycam1 (glycosylation-dependent cell adhesion molecule 1) or Fabp3 (fatty acid binding protein 3) are almost restricted to cluster 10, and they exhibit a gradient that correlates with both transient and steady-state cells.
# Look into Cited1 expression
genes <- c("Glycam1", "Fabp3")
cowplot::plot_grid(
PlotExpression(sce,
color.by = genes[1], point.size = 0.5,
point.stroke = 0.5, scale.values = TRUE
),
PlotExpression(sce.bins,
color.by = genes[1], point.size = 1,
point.stroke = 1, scale.values = TRUE
),
PlotExpression(sce,
color.by = genes[2], point.size = 0.5,
point.stroke = 0.5, scale.values = TRUE
),
PlotExpression(sce.bins,
color.by = genes[2], point.size = 1,
point.stroke = 1, scale.values = TRUE
),
ncol = 2
)## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ensembldb_2.35.0 AnnotationFilter_1.35.0
## [3] GenomicFeatures_1.63.1 AnnotationDbi_1.73.0
## [5] ComplexHeatmap_2.27.0 Coralysis_1.1.0
## [7] scRNAseq_2.25.0 scater_1.39.1
## [9] ggplot2_4.0.1 scuttle_1.21.0
## [11] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.0
## [13] Biobase_2.71.0 GenomicRanges_1.63.1
## [15] Seqinfo_1.1.0 IRanges_2.45.0
## [17] S4Vectors_0.49.0 BiocGenerics_0.57.0
## [19] generics_0.1.4 MatrixGenerics_1.23.0
## [21] matrixStats_1.5.0 dplyr_1.1.4
## [23] BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 BiocIO_1.21.0 bitops_1.0-9
## [4] filelock_1.0.3 tibble_3.3.0 XML_3.99-0.20
## [7] lifecycle_1.0.5 httr2_1.2.2 aricode_1.0.3
## [10] edgeR_4.9.2 doParallel_1.0.17 lattice_0.22-7
## [13] alabaster.base_1.11.1 magrittr_2.0.4 limma_3.67.0
## [16] sass_0.4.10 rmarkdown_2.30 jquerylib_0.1.4
## [19] yaml_2.3.12 metapod_1.19.1 cowplot_1.2.0
## [22] LiblineaR_2.10-24 DBI_1.2.3 buildtools_1.0.0
## [25] RColorBrewer_1.1-3 abind_1.4-8 purrr_1.2.1
## [28] RCurl_1.98-1.17 rappdirs_0.3.3 circlize_0.4.17
## [31] ggrepel_0.9.6 irlba_2.3.5.1 alabaster.sce_1.9.0
## [34] maketools_1.3.2 pheatmap_1.0.13 RSpectra_0.16-2
## [37] dqrng_0.4.1 codetools_0.2-20 DelayedArray_0.37.0
## [40] shape_1.4.6.1 tidyselect_1.2.1 UCSC.utils_1.7.1
## [43] farver_2.1.2 ScaledMatrix_1.19.0 viridis_0.6.5
## [46] BiocFileCache_3.1.0 GenomicAlignments_1.47.0 jsonlite_2.0.0
## [49] GetoptLong_1.1.0 BiocNeighbors_2.5.0 iterators_1.0.14
## [52] foreach_1.5.2 tools_4.5.2 Rcpp_1.1.0.8.2
## [55] glue_1.8.0 gridExtra_2.3 SparseArray_1.11.10
## [58] xfun_0.55 flexclust_1.5.0 GenomeInfoDb_1.47.2
## [61] HDF5Array_1.39.0 gypsum_1.7.0 withr_3.0.2
## [64] BiocManager_1.30.27 fastmap_1.2.0 rhdf5filters_1.23.3
## [67] bluster_1.21.0 SparseM_1.84-2 digest_0.6.39
## [70] rsvd_1.0.5 R6_2.6.1 colorspace_2.1-2
## [73] Cairo_1.7-0 RSQLite_2.4.5 cigarillo_1.1.0
## [76] h5mread_1.3.1 rtracklayer_1.71.3 class_7.3-23
## [79] httr_1.4.7 S4Arrays_1.11.1 uwot_0.2.4
## [82] pkgconfig_2.0.3 gtable_0.3.6 modeltools_0.2-24
## [85] blob_1.2.4 S7_0.2.1 XVector_0.51.0
## [88] sys_3.4.3 htmltools_0.5.9 clue_0.3-66
## [91] ProtGenerics_1.43.0 scales_1.4.0 alabaster.matrix_1.9.0
## [94] png_0.1-8 scran_1.39.0 knitr_1.51
## [97] reshape2_1.4.5 rjson_0.2.23 curl_7.0.0
## [100] GlobalOptions_0.1.3 cachem_1.1.0 rhdf5_2.55.12
## [103] stringr_1.6.0 BiocVersion_3.23.1 parallel_4.5.2
## [106] vipor_0.4.7 ggrastr_1.0.2 restfulr_0.0.16
## [109] pillar_1.11.1 alabaster.schemas_1.11.0 vctrs_0.6.5
## [112] RANN_2.6.2 BiocSingular_1.27.1 dbplyr_2.5.1
## [115] beachmat_2.27.1 cluster_2.1.8.1 beeswarm_0.4.0
## [118] evaluate_1.0.5 magick_2.9.0 cli_3.6.5
## [121] locfit_1.5-9.12 compiler_4.5.2 Rsamtools_2.27.0
## [124] rlang_1.1.7 crayon_1.5.3 labeling_0.4.3
## [127] plyr_1.8.9 ggbeeswarm_0.7.3 stringi_1.8.7
## [130] viridisLite_0.4.2 alabaster.se_1.9.0 BiocParallel_1.45.0
## [133] Biostrings_2.79.4 lazyeval_0.2.2 Matrix_1.7-4
## [136] ExperimentHub_3.1.0 sparseMatrixStats_1.23.0 bit64_4.6.0-1
## [139] Rhdf5lib_1.33.0 KEGGREST_1.51.1 statmod_1.5.1
## [142] alabaster.ranges_1.11.0 AnnotationHub_4.1.0 igraph_2.2.1
## [145] memoise_2.0.1 bslib_0.9.0 bit_4.6.0
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.
Bach K, Pensa S, Grzelak M, Hadfield J, Adams DJ, Marioni JC, Khaled WT (2017). “Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing.” Nature communications, 8, 1:1-11. doi:10.1038/s41467-017-02001-5
Gu, Z. (2022) “Complex heatmap visualization.” iMeta, 1(3):e43. doi: 10.1002/imt2.43.
Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi:10.12688/f1000research.9501.2.
McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). “Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.” Bioinformatics, 33, 1179-1186. doi:10.1093/bioinformatics/btw777.
Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023
Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.