CatsCradle SingleCellExperiment Quick Start

Anna Laddach and Michael Shapiro

2025-02-13

BiocStyle

Introduction

CatsCradle provides two types of functionality for analysing single cell data. It provides tools for the clustering and annotation of genes and it provides extensive tools for analysing spatial transcriptomics data. Here we exposit these capabilities using the class SingleCellExperiment.

Clustering and annotation of genes

A typical analysis of scRNA-seq data involves dimensionality reduction (PCA, UMAP, tSNE) and the clustering of cells using the Louvain algorithm. All of this is done based on the similarities and differences in the genes cells express. Here we see a UMAP where the points are cells, coloured by their clusters.

library(SingleCellExperiment,quietly=TRUE)
library(CatsCradle,quietly=TRUE)
library(ggplot2,quietly=TRUE)
getExample = make.getExample()
S_sce = getExample('S_sce')
#> Warning: Layer 'counts' is empty
#> Warning: Layer 'scale.data' is empty
df = as.data.frame(reducedDim(S_sce,'UMAP'))
df$seurat_clusters = S_sce$seurat_clusters
g = ggplot(df,aes(x=UMAP_1,y=UMAP_2,color=seurat_clusters)) +
    geom_point() +
    ggtitle('SingleCellExperiment of cells colored by seurat_cluster')
print(g)    

By transposing the expression matrix, we can use the same technology to produce an object where the samples are the genes and the features are cells. Genes now have their own UMAP and their own clusters. This is all done on the basis of the similarities and differences in the cells they are expressed in.

STranspose_sce = getExample('STranspose_sce')
print(STranspose_sce)
#> class: SingleCellExperiment 
#> dim: 540 2000 
#> metadata(0):
#> assays(3): counts logcounts scaledata
#> rownames(540): AAAGAACTCATCGCTC-1-1 AACAAAGTCGTAGGGA-1-1 ...
#>   TTCTTGATCTGAACGT-1-8 TTTCAGTAGCGTCTCG-1-8
#> rowData names(0):
#> colnames(2000): Ctsb H2-Ab1 ... Phkg1 Gm9530
#> colData names(6): orig.ident nCount_RNA ... seurat_clusters ident
#> reducedDimNames(2): PCA UMAP
#> mainExpName: RNA
#> altExpNames(0):
df = as.data.frame(reducedDim(STranspose_sce,'UMAP'))
df$seurat_clusters = STranspose_sce$seurat_clusters
h = ggplot(df,aes(x=umap_1,y=umap_2,color=seurat_clusters)) +
    geom_point() +
    ggtitle('SingleCellExperiment of genes colored by cluster')
print(h)  

Gene clusters vs. cell clusters

There are a number of ways of investigating the relationship between gene clusters and cell clusters. One is by computing the average expression of each gene cluster in each cell cluster; this can be displayed as a heat map or a Sankey graph - the cat’s cradle of our CatsCradle. (See getAverageExpressionMatrix() and sankeyFromMatrix() in the CatsCradle vignette CatsCradle.)

Spatial co-location of genes on gene UMAP

Gene sets such as Hallmark tend to localise on gene UMAP, though they are not necessarily confined to specific gene clusters. Here we show HALLMARK_OXIDATIVE_PHOSPHORYLATION (subset to the genes in our SingleCellExperiment object) superimposed in green and black on the gene cluster UMAP.

hallmark = getExample('hallmark')
h = 'HALLMARK_OXIDATIVE_PHOSPHORYLATION'

idx = colnames(STranspose_sce) %in% hallmark[[h]]

k = ggplot(data=df,aes(x=umap_1,y=umap_2,color=seurat_clusters)) +
    geom_point() + 
    geom_point(data=df[idx,],aes(x=umap_1,y=umap_2),color='black',size=2.7) +
    geom_point(data=df[idx,],aes(x=umap_1,y=umap_2),color='green') +
    ggtitle(paste(h,'\non gene clusters'))

print(k)

pValue = getObjectSubsetClusteringPValue(STranspose_sce,idx)
pValue
#> [1] 0.001

The p-value here is limited by the default number of randomisation trials in getSubsetClusteringPValue(). Of the 50 Hallmark gene sets, 31 have clustering p-values less than 0.05.

The computation of these p-values is ultimately carried out by medianComplementPValue(). We wish to determine whether a subset X of a set of points S is randomly scattered across S or is somehow localised. To do this we consider the complement C of X in S. If X is “spread out” randomly across S, every point of C will be close to some point in X. Accordingly, the median distance from points in C to their nearest point in X will be small. If on the contrary, X is localised, many points in C will be distant from X, thereby producing a larger median complement distance. We produce a p-value by comparing the median complement distance for X to those for randomised sets of the same size as X.

Gene annotation

This allows one to predict functions of a given gene by looking at annotations of its neighbouring genes. If a gene has its own annotation and also has neighbours which have annotations, we can compare the actual annotation to the predicted annotation. These predictions perform well above chance as described in the section Predicting gene function of the CatsCradle vignetteCatsCradle .

Analysis of spatial transcriptomic data

CatsCradle provides extensive tools for the analysis of spatial transcriptomics data. Here we see cell type plotted on the cell centroids of our example data.

library(SpatialExperiment)
X_sce = getExample('X_sce')
cellCoords = as.data.frame(spatialCoords(X_sce))
cellCoords$cluster = X_sce$cluster
ell = ggplot(cellCoords,aes(x=x,y=y,color=cluster)) +
      geom_point() +
      ggtitle('Spatial coordinates colored by cell type')
print(ell)      

With Moran’s I, we can see spatial autocorrelation of gene expression and compute p-values for this. See CatsCradle spatial vignette CatsCradleSpatial.

Neighbourhoods

The key concept in our analysis of this kind of data is the neighbourhood. In general, a neighbourhood is a contiguous set of cells. In all of our use cases, each neighbourhood is set of cells around a given cell. Accordingly, a neighbourhood can be referred to by the name of the cell at its centre.

The simplest type of neighbourhood arises from Delaunay triangulation where each neighbourhood consists of a cell and its immediate neighbours. This is an undirected graph in which each cell has an edge connecting it to each of these neighbours.

delaunayNeighbours = getExample('delaunayNeighbours')
head(delaunayNeighbours,10)
#>    nodeA nodeB
#> 1  16307 16316
#> 2  16314 16317
#> 3  16296 16299
#> 4  16299 16307
#> 5  16309 16316
#> 6  16309 16314
#> 7  12028 12032
#> 8  12032 16914
#> 9  12032 16257
#> 10  2975  3339

We can also make extended neighbourhoods, e.g., by finding the neighbours of each cell, their neighbours, and their neighbours. getExtendedNBHDs() produces a list characterising these neighbours by their combinatorial radius and collapseExtendedNBHDs() collapses these so that all the extended neighbours of each cell are now treated as neighbours.

In addition, neighbourhoods can be found on the basis of Euclidean distance.

For each cell type A, we can ask “What types of cells are found around cells of type A?” We can display the answer in a directed graph. Here we are using an extended neighbourhood. A fat arrow from type A to type B indicates that neighbourhoods around cells of type A have a high proportion of cells of type B. Here we only display an arrow from A to B when cells of type B compose at least 5 percent of the cells in neighbourhoods around type A.

NBHDByCTMatrixExtended = getExample('NBHDByCTMatrixExtended')
#> radius 2
#> radius 3
#> radius 4
clusters = getExample('clusters')
colours = getExample('colours')
cellTypesPerCellTypeMatrixExtended =
      computeCellTypesPerCellTypeMatrix(NBHDByCTMatrixExtended,clusters)

cellTypesPerCellTypeGraphFromCellMatrix(cellTypesPerCellTypeMatrixExtended,
                                        minWeight = 0.05, colours = colours)

#> IGRAPH ab39f20 DNW- 16 62 -- 
#> + attr: coords (g/n), name (v/c), color (v/c), weight (e/n), width
#> | (e/n)
#> + edges from ab39f20 (vertex names):
#>  [1] 0_Oligo->3_Endo       0_Oligo->10_Astro     0_Oligo->11_Oligo    
#>  [4] 0_Oligo->14_Ependymal 3_Endo ->0_Oligo      3_Endo ->5_Astro     
#>  [7] 3_Endo ->6_VLMC       3_Endo ->10_Astro     3_Endo ->11_Oligo    
#> [10] 3_Endo ->14_Ependymal 4_Astro->0_Oligo      4_Astro->5_Astro     
#> [13] 4_Astro->6_VLMC       4_Astro->7_Granule    4_Astro->10_Astro    
#> [16] 4_Astro->11_Oligo     4_Astro->14_Ependymal 4_Astro->20_VLMC     
#> [19] 5_Astro->6_VLMC       5_Astro->11_Oligo     6_VLMC ->10_Astro    
#> + ... omitted several edges

We can also test for statistical significance for the enrichment of a given cell type in the neighbourhoods around another cell type (see CatsCradleSpatial).

Neighourhood objects.

Each neighbourhood is the neighbourhood of a cell. Accordingly, neighbourhoods naturally want to be encoded into single cell objects. There are two ways to do this.

Neighbourhoods and cell types

Just as cells express genes, with poetic license we can say that neighbourhoods “express” cell types. A cell types by neighbourhoods matrix gives the counts of cells of each type in each neighbourhood. This can be taken as the counts matrix for a SingleCellExperiment object. We can then use the Louvain algorithm to cluster neighbourhoods into neighbourhood types. If we like, we can display these on a neighbourhood UMAP.

Here we attach these neighbourhood_clusters from a neighbourhood object (created by computeNBHDVsCTObject()) to our original spatial object and visualize the neighbourhood clusters on the tissue coordinates. (Again, we are using extended neighbourhoods.)

NBHDByCTSingleCellExtended_sce = getExample('NBHDByCTSingleCellExtended_sce')
#> Warning: Feature names cannot have underscores ('_'), replacing with dashes
#> ('-')
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
#> Warning in svd.function(A = t(x = object), nv = npcs, ...): You're computing
#> too large a percentage of total singular values, use a standard svd instead.
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 4261
#> Number of edges: 105168
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.9694
#> Number of communities: 8
#> Elapsed time: 0 seconds
cellCoords$neighbourhood_clusters =
  NBHDByCTSingleCellExtended_sce$neighbourhood_clusters
nbhdPlot = ggplot(cellCoords,aes(x=x,y=y,color=neighbourhood_clusters)) +
           geom_point() +
       ggtitle('Cell coordinates colored by neighbourhood_clusters')
print(nbhdPlot)