Reference-mapping

0.1 Dataset
0.2 Train reference
0.3 Map query
0.4 Prediction accuracy
- 0.4.1 Confusion matrix
- 0.4.2 DimRed
0.5 R session
0.6 References

0.1 Dataset

To illustrate the Coralysis reference-mapping method, we will use in this vignette two pancreatic islet datasets distributed through the scRNAseq R package: MuraroPancreasData (Muraro et al., 2016) and GrunPancreasData (Grun et al., 2016). The former dataset will be the reference and latter the query, i.e., the dataset that will be mapped against the reference.

MuraroPancreasData (reference): 2,122 cells with cell type labels
GrunPancreasData (query): 1,064 cells without cell type labels

The datasets were imported below with the function MuraroPancreasData() from GrunPancreasData() package as a SingleCellExperiment object, the object (class) required by most functions in Coralysis (see Chapter 4 The SingleCellExperiment class - OSCA manual). The SCE object provided comprises counts (raw count data), cell colData (cell metadata), among other data layers. Both datasets were normalised and subsetted to 500 variable genes with the scuttle/scran functions logNormCounts and getTopHVGs, respectively (other preprocessing steps, such as removing unlabeled cells or applying quality control, were carried out following Chapter 3 Using single-cell references - OSCA manual).

Run the code below to import the R packages and data required to reproduce this vignette.

# Packages
library("scran")
library("SingleR")
library("ggplot2")
library("scRNAseq")
library("Coralysis")
library("SingleCellExperiment")

### Objects used and QC steps from SingleR book:
# https://bioconductor.org/books/release/SingleRBook/sc-mode.html

## Reference: muraro-pancreas-2016
# Import data
ref <- MuraroPancreasData()

# Remove unnecessary data & labels
altExps(ref) <- NULL
ref <- ref[, !is.na(ref$label) & ref$label != "unclear"]

# Normalise
ref <- logNormCounts(ref)
counts(ref) <- NULL


## Query: grun-pancreas-2016
# Import data
query <- GrunPancreasData()

# Remove unnecessary data & labels
query <- addPerCellQC(query)
qc <- quickPerCellQC(colData(query),
    percent_subsets = "altexps_ERCC_percent",
    batch = query$donor,
    subset = query$donor %in% c("D17", "D7", "D2")
)
query <- query[, !qc$discard]
altExps(query) <- NULL

# Normalise
query <- logNormCounts(query)
counts(query) <- NULL

# Feature selection with 'scran' package for reference
nhvg <- 500
set.seed(123)
top.hvg <- getTopHVGs(ref, n = nhvg)

# Subset object
ref <- ref[top.hvg, ]

# Filter out genes for query:
# not required, only done to reduce query size object
query <- query[top.hvg, ]

0.2 Train reference

The first step in performing reference mapping with Coralysis is training a dataset that is representative of the biological system under study. In this example, ref and query correspond to the SingleCellExperiment objects of the reference and query pancreatic islet datasets, MuraroPancreasData and GrunPancreasData, respectively. The reference ref is trained through RunParallelDivisiveICP function without providing a batch.label. In case the reference requires integration, a batch.label should be provided. An higher number of threads can be provided to speed up computing time depending on the number of cores available. For this example, the algorithm was run 10 times (L = 10), but generally, this number should be higher (with the default being L = 50). A train set can be built to improve the modeling step in Coralysis (by default, build.train.set=TRUE), with the respective named list of parameters provided to build.train.params. In this case, the number of variable genes was set to 500, as this was the number selected above (nhvg = 500). Next, run the function RunPCA to obtain the main result required for cell type prediction later on. In addition to cell type prediction, the query dataset(s) can be projected onto UMAP. To allow this, the argument return.model should be set to TRUE in both functions RunPCA and RunUMAP.

# Train the reference
set.seed(123)
ref <- RunParallelDivisiveICP(
    object = ref, L = 10, k = 8,
    build.train.params = list(nhvg = 500), # default 2000
    threads = 2
) # runs without 'batch.label'

## WARNING: Setting 'divisive.method' to 'cluster' as 'batch.label=NULL'. 
## If 'batch.label=NULL', 'divisive.method' can be one of: 'cluster', 'random'.

## 
## Building training set...

## Training set successfully built.

## 
## Computing cluster seed.

## 
## Initializing divisive ICP clustering...

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%

## 
## Divisive ICP clustering completed successfully.

## 
## Predicting cell cluster probabilities using ICP models...

## Prediction of cell cluster probabilities completed successfully.

## 
## Multi-level integration completed successfully.

# Compute reference PCA
ref <- RunPCA(ref, return.model = TRUE, pca.method = "stats")

## Divisive ICP: selecting ICP tables multiple of 3

# Compute reference UMAP
set.seed(123)
ref <- RunUMAP(ref,
    return.model = TRUE,
    umap.method = "uwot", dims = 1:15,
    n_neighbors = 30, min_dist = 0.3
)

Below is the UMAP plot for the reference sample with cell type annotations. In this example, we are using the annotations provided in the object. Ideally, the sample should be annotated after training by performing clustering and manual cluster annotation. Only the resulting manual cell type annotations should be used for prediction. For simplicity, we will use the annotations provided in the object.

# Vizualize reference
ref.celltype.plot <- PlotDimRed(
    object = ref,
    color.by = "label",
    dimred = "UMAP",
    point.size = 0.01,
    legend.nrow = 3,
    seed.color = 17
) +
    ggtitle("Reference: ground-truth labels")
ref.celltype.plot

0.3 Map query

Perform reference-mapping with Coralysis by running the function ReferenceMapping. This requires to provide the trained reference (ref) with cell type annotations intended for the prediction (ref.label = "label") and the query dataset (query). The label in the reference aimed to be used for prediction needs to be available on colData(ref). In this case, we are providing the cell type labels from the column label available in the reference ref. Since we want to project the query onto the reference UMAP, set project.umap as TRUE. The argument dimred.name.prefix just sets the name given as prefix of the low dimensional embeddings stored in reducedDimNames(map). The SingleCellExperiment object query will contain the same information as query, with the predictions and embeddings mapped onto the reference. The predictions consist in coral_probability, coral_labels, and pruned_coral_labels stored in colData(query). The coral_labels correspond to the cell type predictions obtained against the reference. The coral_probability represents the proportion of K neighbors from the winning class (k.nn was set to 5 - default 10); the higher the value, the better. The pruned_coral_labels corresponds to the pruned low-confidence predicted cell labels based on the confidence probability scores, i.e., predicted cell labels with confidence scores less than or equal to 0.5 (label.prune.cutoff = 0.5) are considered unclassified and set to NA.

# Reference-mapping
set.seed(1024)
query <- ReferenceMapping(
    ref = ref, query = query, ref.label = "label",
    project.umap = TRUE, dimred.name.prefix = "ref",
    k.nn = 5, label.prune.cutoff = 0.5
)

0.4 Prediction accuracy

Below, SingleR is run with the reference to predict the cell type labels for the query, in order to compare with the Coralysis predictions, since the query dataset used does not provide cell type labels. The accuracy of Coralysis reference-mapping method is presented below together with a confusion matrix between the predicted Coralysis labels (rows) versus predicted SingleR labels (columns).

0.4.1 Confusion matrix

# Compare against SingleR
preds.singler <- SingleR(test = query, ref = ref, labels = ref$label, de.method = "wilcox")
colData(query) <- cbind(colData(query), preds.singler)

# Confusion matrix 'coral_labels'
preds_x_truth <- table(query$coral_labels, query$labels)

# Accuracy for 'coral_labels'
acc <- sum(diag(preds_x_truth)) / sum(preds_x_truth)

# Print confusion matrix
preds_x_truth

##              
##               acinar alpha beta delta duct endothelial epsilon mesenchymal  pp
##   acinar         249     0    0     0    0           0       0           0   0
##   alpha            4   201    0     1    0           0       0           0   0
##   beta             8     1  165    10    0           0       0           0   0
##   delta            0     0    2    45    0           0       0           0   0
##   duct            22     0    9     0  299           1       0           8   0
##   endothelial      0     0    0     0    0           5       0           0   0
##   epsilon          0     0    0     0    0           0       1           0   0
##   mesenchymal      0     0    0     0    0           0       0          15   0
##   pp               0     0    0     0    0           0       0           0  18

The accuracy of Coralysis reference-mapping method was 93.8%.

The accuracy is higher for the pruned Coralysis cell labels, as the low-confidence cell labels were removed. See below the confusion matrix between the pruned Coralysis labels (rows) versus predicted SingleR labels (columns) and respective accuracy.

# Confusion matrix for 'pruned_coral_labels'
pruned_preds_x_truth <- table(query$pruned_coral_labels, query$labels)


# Accuracy for 'pruned_coral_labels'
pruned_acc <- sum(diag(pruned_preds_x_truth)) / sum(pruned_preds_x_truth)

# Print confusion matrix
pruned_preds_x_truth

##              
##               acinar alpha beta delta duct endothelial epsilon mesenchymal  pp
##   acinar         245     0    0     0    0           0       0           0   0
##   alpha            2   201    0     0    0           0       0           0   0
##   beta             2     1  164    10    0           0       0           0   0
##   delta            0     0    2    44    0           0       0           0   0
##   duct            19     0    9     0  299           1       0           5   0
##   endothelial      0     0    0     0    0           5       0           0   0
##   epsilon          0     0    0     0    0           0       0           0   0
##   mesenchymal      0     0    0     0    0           0       0          10   0
##   pp               0     0    0     0    0           0       0           0  18

The accuracy of Coralysis reference-mapping method using the pruned cell labels was 95.1%. The accuracy is higher using the pruned labels as expected once the low-confidence cell predictions were removed.

0.4.2 DimRed

Visualize below the query cells projected onto reference UMAP and how well the predictions match the query ground-truth. The coral_probability is a prediction confidence score. Predictions with low scores (<0.5) should be carefully inspected.

# Plot query and reference UMAP side-by-side # fig.height=7.25
# with ground-truth & predicted cell labels
use.color <- c("acinar" = "#FF7F00", "alpha" = "#CCCCCC", "beta" = "#FDCDAC", 
               "delta" = "#E6F5C9", "duct" = "#E31A1C", "endothelial" = "#377EB8", 
               "epsilon" = "#7FC97F", "mesenchymal" = "#F0027F", "pp" = "#B3CDE3")
query.singler.plot <- PlotDimRed(
    object = query,
    color.by = "labels",
    dimred = "refUMAP",
    point.size = 0.25,
    point.stroke = 0.25,
    legend.nrow = 3,
    use.color = use.color
) +
    ggtitle("Query: SingleR predictions")
query.predicted.plot <- PlotDimRed(
    object = query,
    color.by = "coral_labels",
    dimred = "refUMAP",
    point.size = 0.25,
    point.stroke = 0.25,
    legend.nrow = 3,
    use.color = use.color
) +
    ggtitle("Query: Coralysis predictions")
query.pruned.plot <- PlotDimRed(
    object = query,
    color.by = "pruned_coral_labels",
    dimred = "refUMAP",
    point.size = 0.25,
    point.stroke = 0.25,
    legend.nrow = 3,
    use.color = use.color
) +
    ggtitle("Query: Coralysis predictions pruned")
query.confidence.plot <- PlotExpression(
    object = query,
    color.by = "coral_probability",
    dimred = "refUMAP",
    point.size = 0.25,
    point.stroke = 0.1,
    color.scale = "viridis"
) +
    ggtitle("Query: Coralysis confidence")
cowplot::plot_grid(ref.celltype.plot, query.singler.plot,
    query.predicted.plot, query.pruned.plot,
    query.confidence.plot,
    ncol = 2, align = "vh"
)

0.5 R session

# R session
sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SingleR_2.11.3               MouseGastrulationData_1.23.0
##  [3] SpatialExperiment_1.19.1     scran_1.37.0                
##  [5] ensembldb_2.33.1             AnnotationFilter_1.33.0     
##  [7] GenomicFeatures_1.61.4       AnnotationDbi_1.71.0        
##  [9] ComplexHeatmap_2.25.2        Coralysis_0.99.10           
## [11] scRNAseq_2.23.0              scater_1.37.0               
## [13] ggplot2_3.5.2                scuttle_1.19.0              
## [15] SingleCellExperiment_1.31.1  SummarizedExperiment_1.39.1 
## [17] Biobase_2.69.0               GenomicRanges_1.61.1        
## [19] Seqinfo_0.99.1               IRanges_2.43.0              
## [21] S4Vectors_0.47.0             BiocGenerics_0.55.0         
## [23] generics_0.1.4               MatrixGenerics_1.21.0       
## [25] matrixStats_1.5.0            dplyr_1.1.4                 
## [27] BiocStyle_2.37.0            
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22          BiocIO_1.19.0            
##   [3] bitops_1.0-9              filelock_1.0.3           
##   [5] tibble_3.3.0              XML_3.99-0.18            
##   [7] lifecycle_1.0.4           httr2_1.2.0              
##   [9] aricode_1.0.3             edgeR_4.7.3              
##  [11] doParallel_1.0.17         lattice_0.22-7           
##  [13] alabaster.base_1.9.4      magrittr_2.0.3           
##  [15] limma_3.65.1              sass_0.4.10              
##  [17] rmarkdown_2.29            jquerylib_0.1.4          
##  [19] yaml_2.3.10               metapod_1.17.0           
##  [21] askpass_1.2.1             reticulate_1.42.0        
##  [23] cowplot_1.2.0             LiblineaR_2.10-24        
##  [25] DBI_1.2.3                 RColorBrewer_1.1-3       
##  [27] abind_1.4-8               Rtsne_0.17               
##  [29] purrr_1.1.0               BumpyMatrix_1.17.0       
##  [31] RCurl_1.98-1.17           rappdirs_0.3.3           
##  [33] circlize_0.4.16           ggrepel_0.9.6            
##  [35] irlba_2.3.5.1             alabaster.sce_1.9.0      
##  [37] pheatmap_1.0.13           umap_0.2.10.0            
##  [39] RSpectra_0.16-2           dqrng_0.4.1              
##  [41] DelayedMatrixStats_1.31.0 codetools_0.2-20         
##  [43] DelayedArray_0.35.2       shape_1.4.6.1            
##  [45] tidyselect_1.2.1          UCSC.utils_1.5.0         
##  [47] farver_2.1.2              ScaledMatrix_1.17.0      
##  [49] viridis_0.6.5             BiocFileCache_2.99.5     
##  [51] GenomicAlignments_1.45.1  jsonlite_2.0.0           
##  [53] GetoptLong_1.0.5          BiocNeighbors_2.3.1      
##  [55] iterators_1.0.14          foreach_1.5.2            
##  [57] tools_4.5.1               Rcpp_1.1.0               
##  [59] glue_1.8.0                gridExtra_2.3            
##  [61] SparseArray_1.9.0         xfun_0.52                
##  [63] flexclust_1.5.0           GenomeInfoDb_1.45.7      
##  [65] HDF5Array_1.37.0          gypsum_1.5.0             
##  [67] withr_3.0.2               BiocManager_1.30.26      
##  [69] fastmap_1.2.0             rhdf5filters_1.21.0      
##  [71] bluster_1.19.0            openssl_2.3.3            
##  [73] SparseM_1.84-2            digest_0.6.37            
##  [75] rsvd_1.0.5                R6_2.6.1                 
##  [77] colorspace_2.1-1          Cairo_1.6-2              
##  [79] dichromat_2.0-0.1         RSQLite_2.4.1            
##  [81] h5mread_1.1.1             rtracklayer_1.69.1       
##  [83] class_7.3-23              httr_1.4.7               
##  [85] S4Arrays_1.9.1            uwot_0.2.3               
##  [87] pkgconfig_2.0.3           gtable_0.3.6             
##  [89] modeltools_0.2-24         blob_1.2.4               
##  [91] XVector_0.49.0            htmltools_0.5.8.1        
##  [93] bookdown_0.43             clue_0.3-66              
##  [95] ProtGenerics_1.41.0       scales_1.4.0             
##  [97] alabaster.matrix_1.9.0    png_0.1-8                
##  [99] knitr_1.50                reshape2_1.4.4           
## [101] rjson_0.2.23              curl_6.4.0               
## [103] GlobalOptions_0.1.2       cachem_1.1.0             
## [105] rhdf5_2.53.1              stringr_1.5.1            
## [107] BiocVersion_3.22.0        parallel_4.5.1           
## [109] vipor_0.4.7               ggrastr_1.0.2            
## [111] restfulr_0.0.16           pillar_1.11.0            
## [113] alabaster.schemas_1.9.0   vctrs_0.6.5              
## [115] RANN_2.6.2                BiocSingular_1.25.0      
## [117] dbplyr_2.5.0              beachmat_2.25.2          
## [119] cluster_2.1.8.1           beeswarm_0.4.0           
## [121] evaluate_1.0.4            magick_2.8.7             
## [123] tinytex_0.57              cli_3.6.5                
## [125] locfit_1.5-9.12           compiler_4.5.1           
## [127] Rsamtools_2.25.1          rlang_1.1.6              
## [129] crayon_1.5.3              labeling_0.4.3           
## [131] plyr_1.8.9                ggbeeswarm_0.7.2         
## [133] stringi_1.8.7             viridisLite_0.4.2        
## [135] alabaster.se_1.9.0        BiocParallel_1.43.4      
## [137] Biostrings_2.77.2         lazyeval_0.2.2           
## [139] scrapper_1.3.5            Matrix_1.7-3             
## [141] ExperimentHub_2.99.5      sparseMatrixStats_1.21.0 
## [143] bit64_4.6.0-1             Rhdf5lib_1.31.0          
## [145] KEGGREST_1.49.1           statmod_1.5.0            
## [147] alabaster.ranges_1.9.0    AnnotationHub_3.99.6     
## [149] igraph_2.1.4              memoise_2.0.1            
## [151] bslib_0.9.0               bit_4.6.0

0.6 References

Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. doi:10.1038/s41592-019-0654-x.

Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP, Wolters PJ, Abate AR, Butte AJ, Bhattacharya M (2019). “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol., 20:163-172. doi:10.1038/s41590-018-0276-y.

Grun D, Muraro MJ, Boisset JC, Wiebrands K, Lyubimova A, Dharmadhikari G, van den Born M, van Es J, Jansen E, Clevers H, de Koning EJP, van Oudenaarden EJP (2016). “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell, 19(2):266–77. doi.org/10.1016/j.stem.2016.05.010

Muraro MJ, Dharmadhikari G, Grun D, Groen N, Dielen T, Jansen E, van Gurp L, Engelse MA, Carlotti F, de Koning EJP, van Oudenaarden A (2016). “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Systems, 3(4):385–94. doi.org/10.1016/j.cels.2016.09.002

Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023

Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.