To illustrate the Coralysis
reference-mapping method, we will use in this vignette two pancreatic islet datasets distributed through the scRNAseq
R
package: MuraroPancreasData (Muraro et al., 2016) and GrunPancreasData (Grun et al., 2016). The former dataset will be the reference and latter the query, i.e., the dataset that will be mapped against the reference.
MuraroPancreasData (reference): 2,122 cells with cell type labels
GrunPancreasData (query): 1,064 cells without cell type labels
The datasets were imported below with the function MuraroPancreasData()
from GrunPancreasData()
package as a SingleCellExperiment
object, the object (class) required by most functions in Coralysis
(see Chapter 4 The SingleCellExperiment class - OSCA manual). The SCE
object provided comprises counts
(raw count data), cell colData
(cell metadata), among other data layers. Both datasets were normalised and subsetted to 500 variable genes with the scuttle
/scran
functions logNormCounts
and getTopHVGs
, respectively (other preprocessing steps, such as removing unlabeled cells or applying quality control, were carried out following Chapter 3 Using single-cell references - OSCA manual).
Run the code below to import the R
packages and data required to reproduce this vignette.
# Packages
library("scran")
library("SingleR")
library("ggplot2")
library("scRNAseq")
library("Coralysis")
library("SingleCellExperiment")
### Objects used and QC steps from SingleR book:
# https://bioconductor.org/books/release/SingleRBook/sc-mode.html
## Reference: muraro-pancreas-2016
# Import data
ref <- MuraroPancreasData()
# Remove unnecessary data & labels
altExps(ref) <- NULL
ref <- ref[, !is.na(ref$label) & ref$label != "unclear"]
# Normalise
ref <- logNormCounts(ref)
counts(ref) <- NULL
## Query: grun-pancreas-2016
# Import data
query <- GrunPancreasData()
# Remove unnecessary data & labels
query <- addPerCellQC(query)
qc <- quickPerCellQC(colData(query),
percent_subsets = "altexps_ERCC_percent",
batch = query$donor,
subset = query$donor %in% c("D17", "D7", "D2")
)
query <- query[, !qc$discard]
altExps(query) <- NULL
# Normalise
query <- logNormCounts(query)
counts(query) <- NULL
# Feature selection with 'scran' package for reference
nhvg <- 500
set.seed(123)
top.hvg <- getTopHVGs(ref, n = nhvg)
# Subset object
ref <- ref[top.hvg, ]
# Filter out genes for query:
# not required, only done to reduce query size object
query <- query[top.hvg, ]
The first step in performing reference mapping with Coralysis
is training a dataset that is representative of the biological system under study. In this example, ref
and query
correspond to the SingleCellExperiment
objects of the reference and query pancreatic islet datasets, MuraroPancreasData and GrunPancreasData, respectively. The reference ref
is trained through RunParallelDivisiveICP
function without providing a batch.label
. In case the reference requires integration, a batch.label
should be provided. An higher number of threads
can be provided to speed up computing time depending on the number of cores available. For this example, the algorithm was run 10 times (L = 10
), but generally, this number should be higher (with the default being L = 50
). A train set can be built to improve the modeling step in Coralysis
(by default, build.train.set=TRUE
), with the respective named list
of parameters provided to build.train.params
. In this case, the number of variable genes was set to 500, as this was the number selected above (nhvg = 500
). Next, run the function RunPCA
to obtain the main result required for cell type prediction later on. In addition to cell type prediction, the query dataset(s) can be projected onto UMAP. To allow this, the argument return.model
should be set to TRUE
in both functions RunPCA
and RunUMAP
.
# Train the reference
set.seed(123)
ref <- RunParallelDivisiveICP(
object = ref, L = 10, k = 8,
build.train.params = list(nhvg = 500), # default 2000
threads = 2
) # runs without 'batch.label'
## WARNING: Setting 'divisive.method' to 'cluster' as 'batch.label=NULL'.
## If 'batch.label=NULL', 'divisive.method' can be one of: 'cluster', 'random'.
##
## Building training set...
## Training set successfully built.
##
## Computing cluster seed.
##
## Initializing divisive ICP clustering...
##
|
| | 0%
|
|======= | 10%
|
|============== | 20%
|
|===================== | 30%
|
|============================ | 40%
|
|=================================== | 50%
|
|========================================== | 60%
|
|================================================= | 70%
|
|======================================================== | 80%
|
|=============================================================== | 90%
|
|======================================================================| 100%
##
## Divisive ICP clustering completed successfully.
##
## Predicting cell cluster probabilities using ICP models...
## Prediction of cell cluster probabilities completed successfully.
##
## Multi-level integration completed successfully.
# Compute reference PCA
ref <- RunPCA(ref, return.model = TRUE, pca.method = "stats")
## Divisive ICP: selecting ICP tables multiple of 3
# Compute reference UMAP
set.seed(123)
ref <- RunUMAP(ref,
return.model = TRUE,
umap.method = "uwot", dims = 1:15,
n_neighbors = 30, min_dist = 0.3
)
Below is the UMAP plot for the reference sample with cell type annotations. In this example, we are using the annotations provided in the object. Ideally, the sample should be annotated after training by performing clustering and manual cluster annotation. Only the resulting manual cell type annotations should be used for prediction. For simplicity, we will use the annotations provided in the object.
# Vizualize reference
ref.celltype.plot <- PlotDimRed(
object = ref,
color.by = "label",
dimred = "UMAP",
point.size = 0.01,
legend.nrow = 3,
seed.color = 17
) +
ggtitle("Reference: ground-truth labels")
ref.celltype.plot
Perform reference-mapping with Coralysis
by running the function ReferenceMapping
. This requires to provide the trained reference (ref
) with cell type annotations intended for the prediction (ref.label = "label"
) and the query dataset (query
). The label in the reference aimed to be used for prediction needs to be available on colData(ref)
. In this case, we are providing the cell type labels from the column label
available in the reference ref
. Since we want to project the query onto the reference UMAP, set project.umap
as TRUE
. The argument dimred.name.prefix
just sets the name given as prefix of the low dimensional embeddings stored in reducedDimNames(map)
. The SingleCellExperiment
object query
will contain the same information as query
, with the predictions and embeddings mapped onto the reference. The predictions consist in coral_labels
and coral_probability
stored in colData(query)
. The coral_labels
correspond to the cell type predictions obtained against the reference. The coral_probability
represents the proportion of K neighbors from the winning class (k.nn
was set to 5 - default 10); the higher the value, the better.
# Reference-mapping
set.seed(1024)
query <- ReferenceMapping(
ref = ref, query = query, ref.label = "label",
project.umap = TRUE, dimred.name.prefix = "ref",
k.nn = 5
)
Below, SingleR
is run with the ref
erence to predict the cell type labels for the query
, in order to compare with the Coralysis
predictions, since the query
dataset used does not provide cell type labels. The accuracy of Coralysis
reference-mapping method is presented below together with a confusion matrix between the predicted Coralysis
labels (rows) versus predicted SingleR
labels (columns).
# Compare against SingleR
preds.singler <- SingleR(test = query, ref = ref, labels = ref$label, de.method = "wilcox")
colData(query) <- cbind(colData(query), preds.singler)
# Confusion matrix
preds_x_truth <- table(query$coral_labels, query$labels)
# Accuracy
acc <- sum(diag(preds_x_truth)) / sum(preds_x_truth)
# Print confusion matrix
preds_x_truth
##
## acinar alpha beta delta duct endothelial epsilon mesenchymal pp
## acinar 249 0 0 0 0 0 0 0 0
## alpha 4 201 0 1 0 0 0 0 0
## beta 8 1 165 10 0 0 0 0 0
## delta 0 0 2 45 0 0 0 0 0
## duct 22 0 9 0 299 1 0 8 0
## endothelial 0 0 0 0 0 5 0 0 0
## epsilon 0 0 0 0 0 0 1 0 0
## mesenchymal 0 0 0 0 0 0 0 15 0
## pp 0 0 0 0 0 0 0 0 18
The accuracy of Coralysis
reference-mapping method was 93.8%.
Visualize below the query cells projected onto reference UMAP and how well the predictions match the query ground-truth. The coral_probability
is a prediction confidence score. Predictions with low scores (<0.5) should be carefully inspected.
# Plot query and reference UMAP side-by-side
# with ground-truth & predicted cell labels
use.color <- c("acinar" = "#FF7F00", "alpha" = "#CCCCCC", "beta" = "#FDCDAC",
"delta" = "#E6F5C9", "duct" = "#E31A1C", "endothelial" = "#377EB8",
"epsilon" = "#7FC97F", "mesenchymal" = "#F0027F", "pp" = "#B3CDE3")
query.singler.plot <- PlotDimRed(
object = query,
color.by = "labels",
dimred = "refUMAP",
point.size = 0.25,
point.stroke = 0.25,
legend.nrow = 3,
use.color = use.color
) +
ggtitle("Query: SingleR predictions")
query.predicted.plot <- PlotDimRed(
object = query,
color.by = "coral_labels",
dimred = "refUMAP",
point.size = 0.25,
point.stroke = 0.25,
legend.nrow = 3,
use.color = use.color
) +
ggtitle("Query: Coralysis predictions")
query.confidence.plot <- PlotExpression(
object = query,
color.by = "coral_probability",
dimred = "refUMAP",
point.size = 0.25,
point.stroke = 0.1,
color.scale = "viridis"
) +
ggtitle("Query: Coralysis confidence")
cowplot::plot_grid(ref.celltype.plot, query.singler.plot,
query.predicted.plot, query.confidence.plot,
ncol = 2, align = "vh"
)
# R session
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SingleR_2.11.0 MouseGastrulationData_1.23.0
## [3] SpatialExperiment_1.19.0 scran_1.37.0
## [5] ensembldb_2.33.0 AnnotationFilter_1.33.0
## [7] GenomicFeatures_1.61.0 AnnotationDbi_1.71.0
## [9] ComplexHeatmap_2.25.0 Coralysis_0.99.6
## [11] scRNAseq_2.23.0 scater_1.37.0
## [13] ggplot2_3.5.2 scuttle_1.19.0
## [15] SingleCellExperiment_1.31.0 SummarizedExperiment_1.39.0
## [17] Biobase_2.69.0 GenomicRanges_1.61.0
## [19] GenomeInfoDb_1.45.3 IRanges_2.43.0
## [21] S4Vectors_0.47.0 BiocGenerics_0.55.0
## [23] generics_0.1.3 MatrixGenerics_1.21.0
## [25] matrixStats_1.5.0 dplyr_1.1.4
## [27] BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 BiocIO_1.19.0
## [3] bitops_1.0-9 filelock_1.0.3
## [5] tibble_3.2.1 XML_3.99-0.18
## [7] lifecycle_1.0.4 httr2_1.1.2
## [9] aricode_1.0.3 edgeR_4.7.2
## [11] doParallel_1.0.17 lattice_0.22-7
## [13] alabaster.base_1.9.0 magrittr_2.0.3
## [15] limma_3.65.0 sass_0.4.10
## [17] rmarkdown_2.29 jquerylib_0.1.4
## [19] yaml_2.3.10 metapod_1.17.0
## [21] askpass_1.2.1 reticulate_1.42.0
## [23] cowplot_1.1.3 LiblineaR_2.10-24
## [25] DBI_1.2.3 RColorBrewer_1.1-3
## [27] abind_1.4-8 Rtsne_0.17
## [29] purrr_1.0.4 BumpyMatrix_1.17.0
## [31] RCurl_1.98-1.17 rappdirs_0.3.3
## [33] circlize_0.4.16 ggrepel_0.9.6
## [35] irlba_2.3.5.1 alabaster.sce_1.9.0
## [37] pheatmap_1.0.12 umap_0.2.10.0
## [39] RSpectra_0.16-2 dqrng_0.4.1
## [41] DelayedMatrixStats_1.31.0 codetools_0.2-20
## [43] DelayedArray_0.35.1 shape_1.4.6.1
## [45] tidyselect_1.2.1 UCSC.utils_1.5.0
## [47] farver_2.1.2 ScaledMatrix_1.17.0
## [49] viridis_0.6.5 BiocFileCache_2.99.0
## [51] GenomicAlignments_1.45.0 jsonlite_2.0.0
## [53] GetoptLong_1.0.5 BiocNeighbors_2.3.0
## [55] iterators_1.0.14 foreach_1.5.2
## [57] tools_4.5.0 Rcpp_1.0.14
## [59] glue_1.8.0 gridExtra_2.3
## [61] SparseArray_1.9.0 xfun_0.52
## [63] flexclust_1.5.0 HDF5Array_1.37.0
## [65] gypsum_1.5.0 withr_3.0.2
## [67] BiocManager_1.30.25 fastmap_1.2.0
## [69] rhdf5filters_1.21.0 bluster_1.19.0
## [71] openssl_2.3.2 SparseM_1.84-2
## [73] digest_0.6.37 rsvd_1.0.5
## [75] R6_2.6.1 colorspace_2.1-1
## [77] Cairo_1.6-2 dichromat_2.0-0.1
## [79] RSQLite_2.3.11 h5mread_1.1.0
## [81] rtracklayer_1.69.0 class_7.3-23
## [83] httr_1.4.7 S4Arrays_1.9.0
## [85] uwot_0.2.3 pkgconfig_2.0.3
## [87] gtable_0.3.6 modeltools_0.2-24
## [89] blob_1.2.4 XVector_0.49.0
## [91] htmltools_0.5.8.1 bookdown_0.43
## [93] clue_0.3-66 ProtGenerics_1.41.0
## [95] scales_1.4.0 alabaster.matrix_1.9.0
## [97] png_0.1-8 knitr_1.50
## [99] reshape2_1.4.4 rjson_0.2.23
## [101] curl_6.2.2 GlobalOptions_0.1.2
## [103] cachem_1.1.0 rhdf5_2.53.0
## [105] stringr_1.5.1 BiocVersion_3.22.0
## [107] parallel_4.5.0 vipor_0.4.7
## [109] ggrastr_1.0.2 restfulr_0.0.15
## [111] pillar_1.10.2 alabaster.schemas_1.9.0
## [113] vctrs_0.6.5 RANN_2.6.2
## [115] BiocSingular_1.25.0 dbplyr_2.5.0
## [117] beachmat_2.25.0 cluster_2.1.8.1
## [119] beeswarm_0.4.0 evaluate_1.0.3
## [121] magick_2.8.6 tinytex_0.57
## [123] cli_3.6.5 locfit_1.5-9.12
## [125] compiler_4.5.0 Rsamtools_2.25.0
## [127] rlang_1.1.6 crayon_1.5.3
## [129] labeling_0.4.3 plyr_1.8.9
## [131] ggbeeswarm_0.7.2 stringi_1.8.7
## [133] viridisLite_0.4.2 alabaster.se_1.9.0
## [135] BiocParallel_1.43.0 Biostrings_2.77.0
## [137] lazyeval_0.2.2 scrapper_1.3.0
## [139] Matrix_1.7-3 ExperimentHub_2.99.0
## [141] sparseMatrixStats_1.21.0 bit64_4.6.0-1
## [143] Rhdf5lib_1.31.0 KEGGREST_1.49.0
## [145] statmod_1.5.0 alabaster.ranges_1.9.0
## [147] AnnotationHub_3.99.1 igraph_2.1.4
## [149] memoise_2.0.1 bslib_0.9.0
## [151] bit_4.6.0
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. doi:10.1038/s41592-019-0654-x.
Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP, Wolters PJ, Abate AR, Butte AJ, Bhattacharya M (2019). “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol., 20:163-172. doi:10.1038/s41590-018-0276-y.
Grun D, Muraro MJ, Boisset JC, Wiebrands K, Lyubimova A, Dharmadhikari G, van den Born M, van Es J, Jansen E, Clevers H, de Koning EJP, van Oudenaarden EJP (2016). “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell, 19(2):266–77. doi.org/10.1016/j.stem.2016.05.010
Muraro MJ, Dharmadhikari G, Grun D, Groen N, Dielen T, Jansen E, van Gurp L, Engelse MA, Carlotti F, de Koning EJP, van Oudenaarden A (2016). “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Systems, 3(4):385–94. doi.org/10.1016/j.cels.2016.09.002
Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023
Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.