Integration

0.1 Dataset
0.2 Normalization
0.3 HVG selection
0.4 DimRed: pre-integration
0.5 Multi-level integration
0.6 DimRed: post-integration
0.7 Clustering
0.8 Cluster markers
0.9 DGE
0.10 R session
0.11 References

0.1 Dataset

To illustrate the multi-level integration algorithm, we will use in this vignette two samples from the MouseGastrulationData R package: 1 and 3. The samples correspond to different pool and embryonic stages:

1: pool index 1 and embryonic stage E6.5 with 360 cells
3: pool index 3 and embryonic stage E7.5 with 458 cells

The samples were imported below with the function EmbryoAtlasData() from MouseGastrulationData package as a SingleCellExperiment object, the object (class) required by most functions in Coralysis (see Chapter 4 The SingleCellExperiment class - OSCA manual). The SCE object provided comprises counts (raw count data), cell colData (which includes batch and cell labels, designated as pool and celltype, respectively), among other data layers.

Run the code below to import the R packages and data required to reproduce this vignette.

# Packages
library("scran")
library("ggplot2")
library("Coralysis")
library("SingleCellExperiment")
library("MouseGastrulationData")

# Import data
sce <- EmbryoAtlasData(samples = c(1, 3))

0.2 Normalization

Coralysis requires log-normalised data as input. Run the code below to apply the basic log-normalisation method implemented in Seurat (see NormalizeData).

## Normalize the data
# log1p normalization
SeuratNormalisation <- function(object) {
    # 'SeuratNormalisation()' function applies the basic Seurat normalization to
    # a SingleCellExperiment object with a 'counts' assay. Normalized data
    # is saved in the 'logcounts' assay.
    logcounts(object) <- apply(counts(object), 2, function(x) {
        log1p(x / sum(x) * 10000)
    }) # log1p normalization w/ 10K scaling factor
    logcounts(object) <- as(logcounts(object), "sparseMatrix")
    return(object)
}
sce <- SeuratNormalisation(object = sce)

# Remove 'counts' to reduce data size object
counts(sce) <- NULL

In alternative, scran normalization can be performed, which is particularly beneficial if rare cell types exist (see the following vignette). In addition, have a look into Chapter 7 Normalization - OSCA manual.

0.3 HVG selection

Highly variable genes (HVG) can be selected using the R package scran. The variable pool from colData is used as batch label. The SingleCellExperiment object allows alternative experiments to the main experiment. This is important to keep a backup of all genes in the same SingleCellExperiment object before selecting HVGs (see SingleCellExperiment vignette).

# Feature selection with 'scran' package
nhvg <- 500
batch.label <- "pool"
sce[[batch.label]] <- factor(sce[[batch.label]])
m.hvg <- scran::modelGeneVar(sce, block = sce[[batch.label]])
top.hvg <- getTopHVGs(m.hvg, n = nhvg)

# Subset object
sce <- sce[top.hvg, ]

# Rename genes: 'Ensembl ID' to SYMBOL
row.names(sce) <- rowData(sce)$SYMBOL
rowData(sce) <- NULL

0.4 DimRed: pre-integration

The batch effect between assays can be inspected below by projecting the data onto t-distributed Stochastic Neighbor Embedding (t-SNE). This can be achieved by running sequentially the Coralysis functions RunPCA and RunTSNE. Provide a seed before running each one of these functions to ensure reproducibility. The function RunPCA runs by default the PCA method implemented in the R package irlba (pca.method="irlba"), which requires a seed to ensure the same PCA result. In addition, the assay.name argument needs to be provided, otherwise uses by default the Coralysis probabilities which are obtained only after integration (after running RunParallelDivisiveICP). The assay logcounts, corresponding to the log-normalized data, and number of principal components to use p were provided. Any categorical variable available in colData(sce), such as pool or celltype, can be visualized in a low dimensional embedding stored in reducedDimNames(sce) with the Coralysis function PlotDimRed.

# Compute PCA & TSNE
set.seed(123)
sce <- RunPCA(
    object = sce,
    assay.name = "logcounts",
    p = 30, dimred.name = "unintPCA"
)
set.seed(123)
sce <- RunTSNE(sce,
    dimred.type = "unintPCA",
    dimred.name = "unintTSNE"
)

# Plot TSNE highlighting the batch & cell type
unint.batch.plot <- PlotDimRed(
    object = sce,
    color.by = "pool",
    dimred = "unintTSNE",
    point.size = 0.01,
    legend.nrow = 1,
    seed.color = 1024
)
unint.cell.plot <- PlotDimRed(
    object = sce,
    color.by = "celltype",
    dimred = "unintTSNE",
    point.size = 0.01,
    legend.nrow = 14,
    seed.color = 7
)
cowplot::plot_grid(unint.batch.plot, unint.cell.plot, ncol = 2, align = "vh")

0.5 Multi-level integration

Integrate assays with the multi-level integration algorithm implemented in Coralysis by running the function RunParallelDivisiveICP. The only arguments required by this function are object and batch.label. The object requires a SingleCellExperiment object with the assay logcounts. The matrix in logcounts should be sparse, i.e., is(logcounts(sce), "dgCMatrix") is TRUE, and it should not contain non-expressing genes. This is ensured by running PrepareData before. The batch.label argument requires a label column name in colData(sce) corresponding to the batch label that should be used for integration. In the absence of a batch, the same function, RunParallelDivisiveICP, can be run without providing batch.label (i.e., batch.label = NULL), in which case the data will be modeled through the algorithm to identify fine-grained populations that do not required batch correction. An higher number of threads can be provided to speed up computing time depending on the number of cores available. For this example, the algorithm was run 10 times (L = 10), but generally, this number should be higher (with the default being L = 50). A train set can be built to improve the modeling step in Coralysis. In this case, the option was set to FALSE (build.train.set = FALSE) as it is faster and requires less memory. Otherwise, using this option is recommended as it can make the modeling step faster and consume less memory for larger datasets.

# Prepare data for integration:
# remove non-expressing genes & logcounts is from `dgCMatrix` class
sce <- PrepareData(object = sce)

## Data in `logcounts` slot already of `dgCMatrix` class...

## 500/500 features remain after filtering features with only zero values.

# Perform integration with Coralysis
set.seed(1024)
sce <- RunParallelDivisiveICP(
    object = sce,
    batch.label = "pool",
    build.train.set = FALSE,
    L = 10, threads = 2
)

## 
## Computing cluster seed.

## 
## Initializing divisive ICP clustering...

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%

## 
## Divisive ICP clustering completed successfully.

## 
## Predicting cell cluster probabilities using ICP models...

## Prediction of cell cluster probabilities completed successfully.

## 
## Multi-level integration completed successfully.

0.6 DimRed: post-integration

The integration result can be visually inspected by running sequentially the functions RunPCA and RunTSNE. The assay.name provided to RunPCA must be joint.probability (the default), the primary output of integration with Coralysis. The probability matrices from Coralysis (i.e., joint.probability) can be used to obtain an integrated embedding by running RunPCA(..., assay.name = "joint.probability"). This integrated PCA can, in turn, be used downstream for clustering or non-linear dimensional reduction techniques, such as RunTSNE. Below, the integrated PCA was named intPCA.

# Compute PCA with joint cluster probabilities & TSNE
set.seed(123)
sce <- RunPCA(sce,
    assay.name = "joint.probability",
    dimred.name = "intPCA"
)

## Divisive ICP: selecting ICP tables multiple of 4

set.seed(123)
sce <- RunTSNE(sce,
    dimred.type = "intPCA",
    dimred.name = "intTSNE"
)

# Plot TSNE highlighting the batch & cell type
int.batch.plot <- PlotDimRed(
    object = sce,
    color.by = "pool",
    dimred = "intTSNE",
    point.size = 0.01,
    legend.nrow = 1,
    seed.color = 1024
)
int.cell.plot <- PlotDimRed(
    object = sce,
    color.by = "celltype",
    dimred = "intTSNE",
    point.size = 0.01,
    legend.nrow = 14,
    seed.color = 7
)
cowplot::plot_grid(int.batch.plot, int.cell.plot,
    ncol = 2, align = "vh"
)

0.7 Clustering

Run graph-based clustering with the scran function clusterCells (see Chapter 5 Clustering - OSCA manual).

# Graph-based clustering on the integrated PCA w/ 'scran' package
blusparams <- bluster::SNNGraphParam(k = 15, cluster.fun = "louvain")
set.seed(123)
sce$cluster <- scran::clusterCells(sce,
    use.dimred = "intPCA",
    BLUSPARAM = blusparams
)

# Plot clustering
clt.plot <- PlotDimRed(
    object = sce,
    color.by = "cluster",
    dimred = "intTSNE",
    point.size = 0.01,
    legend.nrow = 3,
    seed.color = 65
)
cowplot::plot_grid(int.batch.plot, int.cell.plot,
    clt.plot,
    ncol = 3, align = "h"
)

0.8 Cluster markers

Identify the cluster markers by running the Coralysis function FindAllClusterMarkers. Provide the clustering.label, in this case, the label used above, i.e., cluster. The top three positive markers per cluster were retrieved and plotted below using the Coralysis function HeatmapFeatures.

# Cluster markers
cluster.markers <- FindAllClusterMarkers(object = sce, clustering.label = "cluster")

## -----------------------------------
## testing cluster 1
## 496 features left after min.pct filtering
## 496 features left after min.diff.pct filtering
## 191 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 2
## 497 features left after min.pct filtering
## 497 features left after min.diff.pct filtering
## 183 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 3
## 499 features left after min.pct filtering
## 499 features left after min.diff.pct filtering
## 189 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 4
## 498 features left after min.pct filtering
## 498 features left after min.diff.pct filtering
## 263 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 5
## 498 features left after min.pct filtering
## 498 features left after min.diff.pct filtering
## 277 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 6
## 498 features left after min.pct filtering
## 498 features left after min.diff.pct filtering
## 399 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 7
## 499 features left after min.pct filtering
## 499 features left after min.diff.pct filtering
## 226 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 8
## 497 features left after min.pct filtering
## 497 features left after min.diff.pct filtering
## 210 features left after log2fc.threshold filtering
## -----------------------------------

# Select the top 3 positive markers per cluster
top3.markers <- lapply(X = split(x = cluster.markers, f = cluster.markers$cluster), FUN = function(x) {
    head(x[order(x$log2FC, decreasing = TRUE), ], n = 3)
})
top3.markers <- do.call(rbind, top3.markers)
top3.markers <- top3.markers[order(as.numeric(top3.markers$cluster)), ]

# Heatmap of the top 3 positive markers per cluster
HeatmapFeatures(
    object = sce,
    clustering.label = "cluster",
    features = top3.markers$marker,
    seed.color = 65
)

0.9 DGE

Differential gene expression (DGE) between cluster 3 and 6 corresponding roughly to Visceral endoderm and ExE endoderm, respectively. The genes Ttr, Rbp4, Apoa1, Apom and Ctsl are upregulated in cluster 6, while Tmsb10 is upregulated in cluster 3.

# DGE analysis: cluster 3 vs 6
dge.clt3vs6 <- FindClusterMarkers(sce,
    clustering.label = "cluster",
    clusters.1 = "3",
    clusters.2 = "6"
)

## testing cluster group.1
## 473 features left after min.pct filtering
## 473 features left after min.diff.pct filtering
## 356 features left after log2fc.threshold filtering

head(dge.clt3vs6[order(abs(dge.clt3vs6$log2FC), decreasing = TRUE), ])

##             p.value  adj.p.value    log2FC     pct.1     pct.2  diff.pct marker
## Ttr    1.312605e-26 6.563025e-24 -3.066542 0.5500000 1.0000000 0.4500000    Ttr
## Apoa1  2.699740e-25 1.349870e-22 -2.963640 0.5833333 1.0000000 0.4166667  Apoa1
## Rbp4   7.497694e-27 3.748847e-24 -2.814879 0.5333333 1.0000000 0.4666667   Rbp4
## Apom   8.502979e-27 4.251489e-24 -2.400533 0.5666667 1.0000000 0.4333333   Apom
## Tmsb10 9.995506e-26 4.997753e-23  2.137930 1.0000000 0.6315789 0.3684211 Tmsb10
## Ctsl   4.817850e-23 2.408925e-20 -2.052262 0.8666667 1.0000000 0.1333333   Ctsl

top6.degs <- head(dge.clt3vs6[order(abs(dge.clt3vs6$log2FC),
    decreasing = TRUE
), "marker"])
exp.plots <- lapply(X = top6.degs, FUN = function(x) {
    PlotExpression(
        object = sce, color.by = x,
        scale.values = TRUE, point.size = 0.5, point.stroke = 0.5
    )
})
cowplot::plot_grid(plotlist = exp.plots, align = "vh", ncol = 3)

0.10 R session

# R session
sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] MouseGastrulationData_1.23.0 SpatialExperiment_1.19.1    
##  [3] scran_1.37.0                 ensembldb_2.33.1            
##  [5] AnnotationFilter_1.33.0      GenomicFeatures_1.61.4      
##  [7] AnnotationDbi_1.71.0         ComplexHeatmap_2.25.2       
##  [9] Coralysis_0.99.10            scRNAseq_2.23.0             
## [11] scater_1.37.0                ggplot2_3.5.2               
## [13] scuttle_1.19.0               SingleCellExperiment_1.31.1 
## [15] SummarizedExperiment_1.39.1  Biobase_2.69.0              
## [17] GenomicRanges_1.61.1         Seqinfo_0.99.1              
## [19] IRanges_2.43.0               S4Vectors_0.47.0            
## [21] BiocGenerics_0.55.0          generics_0.1.4              
## [23] MatrixGenerics_1.21.0        matrixStats_1.5.0           
## [25] dplyr_1.1.4                  BiocStyle_2.37.0            
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22         BiocIO_1.19.0            bitops_1.0-9            
##   [4] filelock_1.0.3           tibble_3.3.0             XML_3.99-0.18           
##   [7] lifecycle_1.0.4          httr2_1.2.0              aricode_1.0.3           
##  [10] edgeR_4.7.3              doParallel_1.0.17        lattice_0.22-7          
##  [13] alabaster.base_1.9.4     magrittr_2.0.3           limma_3.65.1            
##  [16] sass_0.4.10              rmarkdown_2.29           jquerylib_0.1.4         
##  [19] yaml_2.3.10              metapod_1.17.0           askpass_1.2.1           
##  [22] reticulate_1.42.0        cowplot_1.2.0            LiblineaR_2.10-24       
##  [25] DBI_1.2.3                RColorBrewer_1.1-3       abind_1.4-8             
##  [28] Rtsne_0.17               purrr_1.1.0              BumpyMatrix_1.17.0      
##  [31] RCurl_1.98-1.17          rappdirs_0.3.3           circlize_0.4.16         
##  [34] ggrepel_0.9.6            irlba_2.3.5.1            alabaster.sce_1.9.0     
##  [37] pheatmap_1.0.13          umap_0.2.10.0            RSpectra_0.16-2         
##  [40] dqrng_0.4.1              codetools_0.2-20         DelayedArray_0.35.2     
##  [43] shape_1.4.6.1            tidyselect_1.2.1         UCSC.utils_1.5.0        
##  [46] farver_2.1.2             ScaledMatrix_1.17.0      viridis_0.6.5           
##  [49] BiocFileCache_2.99.5     GenomicAlignments_1.45.1 jsonlite_2.0.0          
##  [52] GetoptLong_1.0.5         BiocNeighbors_2.3.1      iterators_1.0.14        
##  [55] foreach_1.5.2            tools_4.5.1              Rcpp_1.1.0              
##  [58] glue_1.8.0               gridExtra_2.3            SparseArray_1.9.0       
##  [61] xfun_0.52                flexclust_1.5.0          GenomeInfoDb_1.45.7     
##  [64] HDF5Array_1.37.0         gypsum_1.5.0             withr_3.0.2             
##  [67] BiocManager_1.30.26      fastmap_1.2.0            rhdf5filters_1.21.0     
##  [70] bluster_1.19.0           openssl_2.3.3            SparseM_1.84-2          
##  [73] digest_0.6.37            rsvd_1.0.5               R6_2.6.1                
##  [76] colorspace_2.1-1         Cairo_1.6-2              dichromat_2.0-0.1       
##  [79] RSQLite_2.4.1            h5mread_1.1.1            rtracklayer_1.69.1      
##  [82] class_7.3-23             httr_1.4.7               S4Arrays_1.9.1          
##  [85] uwot_0.2.3               pkgconfig_2.0.3          gtable_0.3.6            
##  [88] modeltools_0.2-24        blob_1.2.4               XVector_0.49.0          
##  [91] htmltools_0.5.8.1        bookdown_0.43            clue_0.3-66             
##  [94] ProtGenerics_1.41.0      scales_1.4.0             alabaster.matrix_1.9.0  
##  [97] png_0.1-8                knitr_1.50               reshape2_1.4.4          
## [100] rjson_0.2.23             curl_6.4.0               GlobalOptions_0.1.2     
## [103] cachem_1.1.0             rhdf5_2.53.1             stringr_1.5.1           
## [106] BiocVersion_3.22.0       parallel_4.5.1           vipor_0.4.7             
## [109] ggrastr_1.0.2            restfulr_0.0.16          pillar_1.11.0           
## [112] alabaster.schemas_1.9.0  vctrs_0.6.5              RANN_2.6.2              
## [115] BiocSingular_1.25.0      dbplyr_2.5.0             beachmat_2.25.2         
## [118] cluster_2.1.8.1          beeswarm_0.4.0           evaluate_1.0.4          
## [121] magick_2.8.7             tinytex_0.57             cli_3.6.5               
## [124] locfit_1.5-9.12          compiler_4.5.1           Rsamtools_2.25.1        
## [127] rlang_1.1.6              crayon_1.5.3             labeling_0.4.3          
## [130] plyr_1.8.9               ggbeeswarm_0.7.2         stringi_1.8.7           
## [133] viridisLite_0.4.2        alabaster.se_1.9.0       BiocParallel_1.43.4     
## [136] Biostrings_2.77.2        lazyeval_0.2.2           Matrix_1.7-3            
## [139] ExperimentHub_2.99.5     sparseMatrixStats_1.21.0 bit64_4.6.0-1           
## [142] Rhdf5lib_1.31.0          KEGGREST_1.49.1          statmod_1.5.0           
## [145] alabaster.ranges_1.9.0   AnnotationHub_3.99.6     igraph_2.1.4            
## [148] memoise_2.0.1            bslib_0.9.0              bit_4.6.0

0.11 References

Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.

Griffiths J, Lun A (2024). “MouseGastrulationData: Single-Cell-omics Data across Mouse Gastrulation and Early Organogenesis”. 10.18129/B9.bioc.MouseGastrulationData. R package version 1.18.0.

Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi:10.12688/f1000research.9501.2.

Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023

Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.