Contents

Here, we demonstrate a grid search of clustering parameters with a mouse hippocampus VeraFISH dataset. BANKSY currently provides four algorithms for clustering the BANKSY matrix with clusterBanksy: Leiden (default), Louvain, k-means, and model-based clustering. In this vignette, we run only Leiden clustering. See ?clusterBanksy for more details on the parameters for different clustering methods.

1 Loading the data

The dataset comprises gene expression for 10,944 cells and 120 genes in 2 spatial dimensions. See ?Banksy::hippocampus for more details.

# Load libs
library(Banksy)

library(SummarizedExperiment)
library(SpatialExperiment)
library(scuttle)

library(scater)
library(cowplot)
library(ggplot2)

# Load data
data(hippocampus)
gcm <- hippocampus$expression
locs <- as.matrix(hippocampus$locations)

Here, gcm is a gene by cell matrix, and locs is a matrix specifying the coordinates of the centroid for each cell.

head(gcm[,1:5])
#>         cell_1276 cell_8890 cell_691 cell_396 cell_9818
#> Sparcl1        45         0       11       22         0
#> Slc1a2         17         0        6        5         0
#> Map            10         0       12       16         0
#> Sqstm1         26         0        0        2         0
#> Atp1a2          0         0        4        3         0
#> Tnc             0         0        0        0         0
head(locs)
#>                 sdimx    sdimy
#> cell_1276  -13372.899 15776.37
#> cell_8890    8941.101 15866.37
#> cell_691   -14882.899 15896.37
#> cell_396   -15492.899 15835.37
#> cell_9818   11308.101 15846.37
#> cell_11310  14894.101 15810.37

Initialize a SpatialExperiment object and perform basic quality control. We keep cells with total transcript count within the 5th and 98th percentile:

se <- SpatialExperiment(assay = list(counts = gcm), spatialCoords = locs)
colData(se) <- cbind(colData(se), spatialCoords(se))

# QC based on total counts
qcstats <- perCellQCMetrics(se)
thres <- quantile(qcstats$total, c(0.05, 0.98))
keep <- (qcstats$total > thres[1]) & (qcstats$total < thres[2])
se <- se[, keep]

Next, perform normalization of the data.

# Normalization to mean library size
se <- computeLibraryFactors(se)
aname <- "normcounts"
assay(se, aname) <- normalizeCounts(se, log = FALSE)

2 Parameters

BANKSY has a few key parameters. We describe these below.

2.1 AGF usage

For characterising neighborhoods, BANKSY computes the weighted neighborhood mean (H_0) and the azimuthal Gabor filter (H_1), which estimates gene expression gradients. Setting compute_agf=TRUE computes both H_0 and H_1.

2.2 k-geometric

k_geom specifies the number of neighbors used to compute each H_m for m=0,1. If a single value is specified, the same k_geom will be used for each feature matrix. Alternatively, multiple values of k_geom can be provided for each feature matrix. Here, we use k_geom[1]=15 and k_geom[2]=30 for H_0 and H_1 respectively. More neighbors are used to compute gradients.

We compute the neighborhood feature matrices using normalized expression (normcounts in the se object).

k_geom <- c(15, 30)
se <- computeBanksy(se, assay_name = aname, compute_agf = TRUE, k_geom = k_geom)
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=15
#> Done
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=30
#> Done
#> Computing harmonic m = 0
#> Using 15 neighbors
#> Done
#> Computing harmonic m = 1
#> Using 30 neighbors
#> Centering
#> Done

computeBanksy populates the assays slot with H_0 and H_1 in this instance:

se
#> class: SpatialExperiment 
#> dim: 120 10205 
#> metadata(1): BANKSY_params
#> assays(4): counts normcounts H0 H1
#> rownames(120): Sparcl1 Slc1a2 ... Notch3 Egfr
#> rowData names(0):
#> colnames(10205): cell_1276 cell_691 ... cell_11635 cell_10849
#> colData names(4): sample_id sdimx sdimy sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : sdimx sdimy
#> imgData names(1): sample_id

2.3 lambda

The lambda parameter is a mixing parameter in [0,1] which determines how much spatial information is incorporated for downstream analysis. With smaller values of lambda, BANKY operates in cell-typing mode, while at higher levels of lambda, BANKSY operates in domain-finding mode. As a starting point, we recommend lambda=0.2 for cell-typing and lambda=0.8 for zone-finding. Here, we run lambda=0 which corresponds to non-spatial clustering, and lambda=0.2 for spatially-informed cell-typing. We compute PCs with and without the AGF (H_1).

lambda <- c(0, 0.2)
se <- runBanksyPCA(se, use_agf = c(FALSE, TRUE), lambda = lambda, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000

runBanksyPCA populates the reducedDims slot, with each combination of use_agf and lambda provided.

reducedDimNames(se)
#> [1] "PCA_M0_lam0"   "PCA_M0_lam0.2" "PCA_M1_lam0"   "PCA_M1_lam0.2"

2.4 Clustering parameters

Next, we cluster the BANKSY embedding with Leiden graph-based clustering. This admits two parameters: k_neighbors and resolution. k_neighbors determines the number of k nearest neighbors used to construct the shared nearest neighbors graph. Leiden clustering is then performed on the resultant graph with resolution resolution. For reproducibiltiy we set a seed for each parameter combination.

k <- 50
res <- 1
se <- clusterBanksy(se, use_agf = c(FALSE, TRUE), lambda = lambda, k_neighbors = k, resolution = res, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000

clusterBanksy populates colData(se) with cluster labels:

colnames(colData(se))
#> [1] "sample_id"                "sdimx"                   
#> [3] "sdimy"                    "sizeFactor"              
#> [5] "clust_M0_lam0_k50_res1"   "clust_M0_lam0.2_k50_res1"
#> [7] "clust_M1_lam0_k50_res1"   "clust_M1_lam0.2_k50_res1"

3 Comparing cluster results

To compare clustering runs visually, different runs can be relabeled to minimise their differences with connectClusters:

se <- connectClusters(se)
#> clust_M1_lam0_k50_res1 --> clust_M0_lam0_k50_res1
#> clust_M0_lam0.2_k50_res1 --> clust_M1_lam0_k50_res1
#> clust_M1_lam0.2_k50_res1 --> clust_M0_lam0.2_k50_res1

Visualise spatial coordinates with cluster labels.

cnames <- colnames(colData(se))
cnames <- cnames[grep("^clust", cnames)]
cplots <- lapply(cnames, function(cnm) {
    plotColData(se, x = "sdimx", y = "sdimy", point_size = 0.1, colour_by = cnm) +
        coord_equal() +
        labs(title = cnm) +
        theme(legend.title = element_blank()) +
        guides(colour = guide_legend(override.aes = list(size = 2)))
})

plot_grid(plotlist = cplots, ncol = 2)

Compare all cluster outputs with compareClusters. This function computes pairwise cluster comparison metrics between the clusters in colData(se) based on adjusted Rand index (ARI):

compareClusters(se, func = "ARI")
#>                          clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.673
#> clust_M0_lam0.2_k50_res1                  0.673                    1.000
#> clust_M1_lam0_k50_res1                    1.000                    0.673
#> clust_M1_lam0.2_k50_res1                  0.687                    0.948
#>                          clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.687
#> clust_M0_lam0.2_k50_res1                  0.673                    0.948
#> clust_M1_lam0_k50_res1                    1.000                    0.687
#> clust_M1_lam0.2_k50_res1                  0.687                    1.000

or normalized mutual information (NMI):

compareClusters(se, func = "NMI")
#>                          clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.739
#> clust_M0_lam0.2_k50_res1                  0.739                    1.000
#> clust_M1_lam0_k50_res1                    1.000                    0.739
#> clust_M1_lam0.2_k50_res1                  0.760                    0.947
#>                          clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.760
#> clust_M0_lam0.2_k50_res1                  0.739                    0.947
#> clust_M1_lam0_k50_res1                    1.000                    0.760
#> clust_M1_lam0.2_k50_res1                  0.760                    1.000

See ?compareClusters for the full list of comparison measures.

4 Session information

Vignette runtime:

#> Time difference of 1.086242 mins
sessionInfo()
#> R Under development (unstable) (2024-10-26 r87273 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows Server 2022 x64 (build 20348)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
#>  [3] BiocFileCache_2.15.0        dbplyr_2.5.0               
#>  [5] spatialLIBD_1.19.6          cowplot_1.1.3              
#>  [7] scater_1.35.0               ggplot2_3.5.1              
#>  [9] harmony_1.2.3               Rcpp_1.0.13-1              
#> [11] data.table_1.16.4           scran_1.35.0               
#> [13] scuttle_1.17.0              Seurat_5.1.0               
#> [15] SeuratObject_5.0.2          sp_2.1-4                   
#> [17] SpatialExperiment_1.17.0    SingleCellExperiment_1.29.1
#> [19] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#> [21] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
#> [23] IRanges_2.41.2              S4Vectors_0.45.2           
#> [25] BiocGenerics_0.53.3         generics_0.1.3             
#> [27] MatrixGenerics_1.19.0       matrixStats_1.4.1          
#> [29] Banksy_1.3.0                BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-9             spatstat.sparse_3.1-0    doParallel_1.0.17       
#>   [4] httr_1.4.7               RColorBrewer_1.1-3       tools_4.5.0             
#>   [7] sctransform_0.4.1        DT_0.33                  R6_2.5.1                
#>  [10] lazyeval_0.2.2           uwot_0.2.2               GetoptLong_1.0.5        
#>  [13] withr_3.0.2              gridExtra_2.3            progressr_0.15.1        
#>  [16] cli_3.6.3                spatstat.explore_3.3-3   fastDummies_1.7.4       
#>  [19] labeling_0.4.3           sass_0.4.9               spatstat.data_3.1-4     
#>  [22] ggridges_0.5.6           pbapply_1.7-2            Rsamtools_2.23.1        
#>  [25] dbscan_1.2-0             aricode_1.0.3            dichromat_2.0-0.1       
#>  [28] sessioninfo_1.2.2        parallelly_1.41.0        attempt_0.3.1           
#>  [31] maps_3.4.2.1             limma_3.63.2             pals_1.9                
#>  [34] RSQLite_2.3.9            BiocIO_1.17.1            shape_1.4.6.1           
#>  [37] ica_1.0-3                spatstat.random_3.3-2    dplyr_1.1.4             
#>  [40] Matrix_1.7-1             ggbeeswarm_0.7.2         abind_1.4-8             
#>  [43] lifecycle_1.0.4          yaml_2.3.10              edgeR_4.5.1             
#>  [46] SparseArray_1.7.2        Rtsne_0.17               paletteer_1.6.0         
#>  [49] grid_4.5.0               blob_1.2.4               promises_1.3.2          
#>  [52] dqrng_0.4.1              crayon_1.5.3             miniUI_0.1.1.1          
#>  [55] lattice_0.22-6           beachmat_2.23.5          mapproj_1.2.11          
#>  [58] KEGGREST_1.47.0          magick_2.8.5             ComplexHeatmap_2.23.0   
#>  [61] pillar_1.10.0            knitr_1.49               metapod_1.15.0          
#>  [64] rjson_0.2.23             future.apply_1.11.3      codetools_0.2-20        
#>  [67] leiden_0.4.3.1           glue_1.8.0               spatstat.univar_3.1-1   
#>  [70] vctrs_0.6.5              png_0.1-8                spam_2.11-0             
#>  [73] gtable_0.3.6             rematch2_2.1.2           cachem_1.1.0            
#>  [76] xfun_0.49                S4Arrays_1.7.1           mime_0.12               
#>  [79] survival_3.8-3           RcppHungarian_0.3        iterators_1.0.14        
#>  [82] tinytex_0.54             statmod_1.5.0            bluster_1.17.0          
#>  [85] fitdistrplus_1.2-1       ROCR_1.0-11              nlme_3.1-166            
#>  [88] bit64_4.5.2              filelock_1.0.3           RcppAnnoy_0.0.22        
#>  [91] bslib_0.8.0              irlba_2.3.5.1            vipor_0.4.7             
#>  [94] KernSmooth_2.23-24       colorspace_2.1-1         DBI_1.2.3               
#>  [97] tidyselect_1.2.1         bit_4.5.0.1              compiler_4.5.0          
#> [100] curl_6.0.1               BiocNeighbors_2.1.2      DelayedArray_0.33.3     
#> [103] plotly_4.10.4            rtracklayer_1.67.0       bookdown_0.41           
#> [106] scales_1.3.0             lmtest_0.9-40            rappdirs_0.3.3          
#> [109] stringr_1.5.1            digest_0.6.37            goftest_1.2-3           
#> [112] spatstat.utils_3.1-1     rmarkdown_2.29           benchmarkmeData_1.0.4   
#> [115] RhpcBLASctl_0.23-42      XVector_0.47.0           htmltools_0.5.8.1       
#> [118] pkgconfig_2.0.3          fastmap_1.2.0            GlobalOptions_0.1.2     
#> [121] rlang_1.1.4              htmlwidgets_1.6.4        UCSC.utils_1.3.0        
#> [124] shiny_1.10.0             farver_2.1.2             jquerylib_0.1.4         
#> [127] zoo_1.8-12               jsonlite_1.8.9           BiocParallel_1.41.0     
#> [130] mclust_6.1.1             config_0.3.2             RCurl_1.98-1.16         
#> [133] BiocSingular_1.23.0      magrittr_2.0.3           GenomeInfoDbData_1.2.13 
#> [136] dotCall64_1.2            patchwork_1.3.0          munsell_0.5.1           
#> [139] viridis_0.6.5            reticulate_1.40.0        leidenAlg_1.1.4         
#> [142] stringi_1.8.4            zlibbioc_1.53.0          MASS_7.3-61             
#> [145] plyr_1.8.9               parallel_4.5.0           listenv_0.9.1           
#> [148] ggrepel_0.9.6            deldir_2.0-4             Biostrings_2.75.3       
#> [151] sccore_1.0.5             splines_4.5.0            tensor_1.5              
#> [154] circlize_0.4.16          locfit_1.5-9.10          igraph_2.1.2            
#> [157] spatstat.geom_3.3-4      RcppHNSW_0.6.0           reshape2_1.4.4          
#> [160] ScaledMatrix_1.15.0      XML_3.99-0.17            BiocVersion_3.21.1      
#> [163] evaluate_1.0.1           golem_0.5.1              BiocManager_1.30.25     
#> [166] foreach_1.5.2            httpuv_1.6.15            RANN_2.6.2              
#> [169] tidyr_1.3.1              purrr_1.0.2              polyclip_1.10-7         
#> [172] benchmarkme_1.0.8        clue_0.3-66              future_1.34.0           
#> [175] scattermore_1.2          rsvd_1.0.5               xtable_1.8-4            
#> [178] restfulr_0.0.15          RSpectra_0.16-2          later_1.4.1             
#> [181] viridisLite_0.4.2        tibble_3.2.1             GenomicAlignments_1.43.0
#> [184] memoise_2.0.1            beeswarm_0.4.0           AnnotationDbi_1.69.0    
#> [187] cluster_2.1.8            shinyWidgets_0.8.7       globals_0.16.3