1 Installation and loading dependencies

To install this package, start R and enter (un-commented):

# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# 
# BiocManager::install("CytoMDS")

We now load the packages needed in the current vignette:

suppressPackageStartupMessages(library(HDCytoData))
library(CytoMDS)
library(ggplot2)
library(patchwork)

2 Introduction

The CytoMDS package implements low dimensional visualization of cytometry samples, in order to visually assess distances between them. This, in turn, can greatly help the user to identify quality issues like batch effects or outlier samples, and/or check the presence of potential sample clusters that might align with the experimental design.

The CytoMDS algorithm combines, on the one hand, the concept of Earth Mover’s Distance (EMD) (Orlova et al. 2016), a.k.a. Wasserstein metric and, on the other hand, the metric Multi Dimensional Scaling (MDS) algorithm for the low dimensional projection (Leeuw and Mair 2009).

Besides projection itself, the package also provides some diagnostic tools for both checking the quality of the MDS projection, as well as interpreting the axes of the projection (see below sections).

3 Illustrative dataset

The illustrative dataset that is used throughout this vignette is a mass cytometry (CyTOF) dataset from (Bodenmiller et al. 2012), and provided in the Bioconductor HDCytoData data package (Weber and Soneson (2019)).

This dataset consists of 16 paired samples (8 times 2) of peripheral blood cells from healthy individuals. Among each sample pair, one sample - the reference - was left un-stimulated, while the other sample was stimulated with B cell receptor / Fc receptor cross-linker (BCR-XL).

This public dataset is known to contain a strong differential expression signal between the two conditions (stimulated vs un-stimulated) and as been used in recent work to benchmark differential analysis algorithms ((Weber et al. 2019)) and to design mass cytometry data analysis pipelines ((Nowicka et al. 2017)).

In the CytoMDSpackage, as in the current vignette, matrices of cytometry events intensities, corresponding to one sample, are stored as flowCore::flowFrame (Ellis et al. 2023) objects. Samples of a particular cytometry dataset are then stored as a flowCore::flowSet object, which is a collection of flowFrame’s, i.e. one flowFrame per sample. Therefore, we load the flowSet version of the BodenMiller2012 dataset, obtained from the HDCytoData package.

BCRXL_fs <- HDCytoData::Bodenmiller_BCR_XL_flowSet()
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## loading from cache
## Warning in updateObjectFromSlots(object, ..., verbose = verbose): dropping
## slot(s) 'colnames' from object = 'flowSet'
BCRXL_fs
## A flowSet with 16 experiments.
## 
## column names(39): Time Cell_length ... sample_id population_id

In regular flowSet’s, the experimental design information is typically
stored in the phenoData slot, and this is also the way CytoMDS expects to get its input. However, HDCytoData has chosen to store the experimental design information in a slightly different way, hence the need to convert the data as follows:

phenoData <- flowCore::pData(BCRXL_fs)
additionalPhenoData <- 
    keyword(BCRXL_fs[[1]], "EXPERIMENT_INFO")$EXPERIMENT_INFO
phenoData <- cbind(phenoData, additionalPhenoData)

flowCore::pData(BCRXL_fs) <- phenoData

We also select channels/markers that are biologically relevant, i.e. both the cell type and cell state markers, and store them for further use. We discard the typical housekeeping markers that are founds in flowFrames like time and Cell_length, etc. In total, these mass cytometry samples contain intensities for 24 biologically relevant markers.

markerInfo <- keyword(BCRXL_fs[[1]], "MARKER_INFO")$MARKER_INFO
chClass <- markerInfo$marker_class

table(chClass)
## chClass
##  none  type state 
##    11    10    14
chLabels <- markerInfo$channel_name[chClass != "none"]
(chMarkers <- markerInfo$marker_name[chClass != "none"])
##  [1] "CD3"    "CD45"   "pNFkB"  "pp38"   "CD4"    "CD20"   "CD33"   "pStat5"
##  [9] "CD123"  "pAkt"   "pStat1" "pSHP2"  "pZap70" "pStat3" "CD14"   "pSlp76"
## [17] "pBtk"   "pPlcg2" "pErk"   "pLat"   "IgM"    "pS6"    "HLA-DR" "CD7"

The first step consists in scale transforming the raw data. Indeed, distances between samples make more sense with scaled transformed signal, in which distributional differences are more readable and usable for downstream analysis.

Here, since we are dealing with mass cytometry samples, we use the classical arcsinh() transformation with 5 as co-factor, as described elsewhere ((Nowicka et al. 2017)).

trans <- arcsinhTransform(
    transformationId="ArcsinhTransform", 
    a = 0, 
    b = 1/5, 
    c = 0)

BCRXL_fs_trans <- transform(
    BCRXL_fs,
    transformList(chLabels, trans))

4 Pairwise sample Earth Mover’s Distances

4.1 Calculating distances between samples

We can now calculate pairwise Earth Mover’s Distances (EMD) between all samples of our dataset.

This is done by calling the pairwiseEMDDist() method The simplest way to
use this method is by providing directly a flowCore::flowSet, containing all samples, as input parameter. Note that, for heavy datasets that include a lot of samples, this can create memory issues. To handle this case, CytoMDS provides other ways to call the pairwiseEMDDist() function (see ‘Handling heavy datasets’ section).

Using the channels argument, it is possible to restrict the EMD calculation to some of the channels. Here we simply pass as input the biologically relevant markers selected in the previous section.

pwDist <- pairwiseEMDDist(
    BCRXL_fs_trans,
    channels = chMarkers,
    verbose = FALSE
)
pwDistMatrix <- as.matrix(pwDist)

The return value of the pairwiseEMDDist function is a DistSum object. We can use the as.matrix() method to to convert this object into a matrix, here a symmetric square matrix, with as many rows (columns) as input samples (extract shown here below for the scale-transformed Bodenmiller2012 dataset).

round(pwDistMatrix[1:10, 1:10], 2)
##        1     2     3     4     5     6     7     8     9    10
## 1   0.00 10.46  4.30 11.18  6.39 12.69  7.11 11.85  5.88 10.13
## 2  10.46  0.00 10.91  3.16 11.45  6.06 13.17 10.61  9.61  6.08
## 3   4.30 10.91  0.00 10.72  7.44 13.17  7.47 12.84  5.84 10.00
## 4  11.18  3.16 10.72  0.00 12.10  5.97 12.66  9.63 10.35  6.72
## 5   6.39 11.45  7.44 12.10  0.00  9.19  4.45 10.19  5.28  9.96
## 6  12.69  6.06 13.17  5.97  9.19  0.00 10.56  7.23 11.61  7.74
## 7   7.11 13.17  7.47 12.66  4.45 10.56  0.00  7.24  8.89 11.87
## 8  11.85 10.61 12.84  9.63 10.19  7.23  7.24  0.00 14.50 11.68
## 9   5.88  9.61  5.84 10.35  5.28 11.61  8.89 14.50  0.00  7.10
## 10 10.13  6.08 10.00  6.72  9.96  7.74 11.87 11.68  7.10  0.00

One relevant way to visualize this distance matrix is to draw the histogram of pairwise distances, as shown in the below plot.

distVec <- pwDistMatrix[upper.tri(pwDist)]
distVecDF <- data.frame(dist = distVec)
pHist <- ggplot(distVecDF, mapping = aes(x=dist)) + 
    geom_histogram(fill = "darkgrey", col = "black", bins = 15) + 
    theme_bw() + ggtitle("EMD distances for Bodenmiller2012 dataset")
pHist

4.2 Individual marker contribution in the distance matrix

Since in CytoMDS, the EMD is calculated as an approximation, summing over the one-dimensional marginal marker unidimensional distributions, it is possible to obtain an individual contribution of each marker to the distance matrix. This can be done using the distByFeature() method:

DF <- distByFeature(pwDist)
DF[order(DF$percentage, decreasing = TRUE),]
##        featureName distanceContrib percentage
## 3   pNFkB(Nd142)Dd      0.64393915   7.712532
## 10   pAkt(Sm152)Dd      0.62152388   7.444062
## 5     CD4(Nd145)Dd      0.58168675   6.966928
## 24    CD7(Yb176)Dd      0.55242877   6.616502
## 6    CD20(Sm147)Dd      0.51833588   6.208167
## 4    pp38(Nd144)Dd      0.50751539   6.078569
## 11 pStat1(Eu153)Dd      0.42774647   5.123168
## 23 HLA-DR(Yb174)Dd      0.39561819   4.738364
## 15   CD14(Gd160)Dd      0.39493170   4.730142
## 18 pPlcg2(Er167)Dd      0.38095496   4.562741
## 19   pErk(Er168)Dd      0.38044973   4.556689
## 22    pS6(Yb172)Dd      0.36274288   4.344613
## 20   pLat(Er170)Dd      0.34287354   4.106635
## 1   CD3(110:114)Dd      0.33383030   3.998323
## 17   pBtk(Er166)Dd      0.31021456   3.715475
## 16 pSlp76(Dy164)Dd      0.29994305   3.592452
## 14 pStat3(Gd158)Dd      0.29659401   3.552340
## 7    CD33(Nd148)Dd      0.20340106   2.436158
## 9   CD123(Eu151)Dd      0.17773632   2.128768
## 13 pZap70(Gd156)Dd      0.16365740   1.960143
## 8  pStat5(Nd150)Dd      0.13870620   1.661300
## 21    IgM(Yb171)Dd      0.13134337   1.573114
## 12  pSHP2(Sm154)Dd      0.09837732   1.178276
## 2    CD45(In115)Dd      0.08470643   1.014538

These individual marker contributions can also be displayed visually, using the ggplotDistFeatureImportance() method:

pBar <- ggplotDistFeatureImportance(pwDist)
pBar