This vignette describes the functionality implemented in the CytoMDS
package. CytoMDS
provides support for low dimensional projection of a set of cytometry samples, using concepts such as Earth Mover’s (EMD) distance, and Multi Dimensional Scaling (MDS). This vignette is distributed under a CC BY-SA 4.0 license.
CytoMDS 1.3.4
To install this package, start R and enter (un-commented):
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("CytoMDS")
We now load the packages needed in the current vignette:
suppressPackageStartupMessages(library(HDCytoData))
library(CytoMDS)
library(ggplot2)
library(patchwork)
The CytoMDS
package implements low dimensional visualization of cytometry
samples, in order to visually assess distances between them. This, in turn,
can greatly help the user to identify quality issues like batch effects
or outlier samples, and/or check the presence of potential sample clusters
that might align with the experimental design.
The CytoMDS
algorithm combines, on the one hand, the concept of
Earth Mover’s Distance (EMD) (Orlova et al. 2016), a.k.a. Wasserstein metric
and, on the other hand, the metric Multi Dimensional Scaling (MDS) algorithm
for the low dimensional projection (Leeuw and Mair 2009).
Besides projection itself, the package also provides some diagnostic tools for both checking the quality of the MDS projection, as well as interpreting the axes of the projection (see below sections).
The illustrative dataset that is used throughout this vignette is a mass cytometry (CyTOF) dataset from (Bodenmiller et al. 2012), and provided in the Bioconductor HDCytoData data package (Weber and Soneson (2019)).
This dataset consists of 16 paired samples (8 times 2) of peripheral blood cells from healthy individuals. Among each sample pair, one sample - the reference - was left un-stimulated, while the other sample was stimulated with B cell receptor / Fc receptor cross-linker (BCR-XL).
This public dataset is known to contain a strong differential expression signal between the two conditions (stimulated vs un-stimulated) and as been used in recent work to benchmark differential analysis algorithms ((Weber et al. 2019)) and to design mass cytometry data analysis pipelines ((Nowicka et al. 2017)).
In the CytoMDS
package, as in the current vignette,
matrices of cytometry events intensities, corresponding to one sample,
are stored as flowCore::flowFrame
(Ellis et al. 2023) objects.
Samples of a particular cytometry dataset are then stored
as a flowCore::flowSet
object, which is a collection of flowFrame’s,
i.e. one flowFrame per sample. Therefore, we load the flowSet version
of the BodenMiller2012 dataset, obtained from the HDCytoData
package.
BCRXL_fs <- HDCytoData::Bodenmiller_BCR_XL_flowSet()
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## loading from cache
## Warning in updateObjectFromSlots(object, ..., verbose = verbose): dropping
## slot(s) 'colnames' from object = 'flowSet'
BCRXL_fs
## A flowSet with 16 experiments.
##
## column names(39): Time Cell_length ... sample_id population_id
In regular flowSet’s, the experimental design information is typically
stored in the phenoData
slot, and this is also the way CytoMDS
expects to
get its input. However, HDCytoData
has chosen to store the experimental
design information in a slightly different way, hence the need to convert
the data as follows:
phenoData <- flowCore::pData(BCRXL_fs)
additionalPhenoData <-
keyword(BCRXL_fs[[1]], "EXPERIMENT_INFO")$EXPERIMENT_INFO
phenoData <- cbind(phenoData, additionalPhenoData)
flowCore::pData(BCRXL_fs) <- phenoData
We also select channels/markers that are biologically relevant, i.e. both the cell type and cell state markers, and store them for further use. We discard the typical housekeeping markers that are founds in flowFrames like time and Cell_length, etc. In total, these mass cytometry samples contain intensities for 24 biologically relevant markers.
markerInfo <- keyword(BCRXL_fs[[1]], "MARKER_INFO")$MARKER_INFO
chClass <- markerInfo$marker_class
table(chClass)
## chClass
## none type state
## 11 10 14
chLabels <- markerInfo$channel_name[chClass != "none"]
(chMarkers <- markerInfo$marker_name[chClass != "none"])
## [1] "CD3" "CD45" "pNFkB" "pp38" "CD4" "CD20" "CD33" "pStat5"
## [9] "CD123" "pAkt" "pStat1" "pSHP2" "pZap70" "pStat3" "CD14" "pSlp76"
## [17] "pBtk" "pPlcg2" "pErk" "pLat" "IgM" "pS6" "HLA-DR" "CD7"
The first step consists in scale transforming the raw data. Indeed, distances between samples make more sense with scaled transformed signal, in which distributional differences are more readable and usable for downstream analysis.
Here, since we are dealing with mass cytometry samples, we use the classical
arcsinh()
transformation with 5 as co-factor, as described elsewhere
((Nowicka et al. 2017)).
trans <- arcsinhTransform(
transformationId="ArcsinhTransform",
a = 0,
b = 1/5,
c = 0)
BCRXL_fs_trans <- transform(
BCRXL_fs,
transformList(chLabels, trans))
We can now calculate pairwise Earth Mover’s Distances (EMD) between all samples of our dataset.
This is done by calling the pairwiseEMDDist()
method The simplest way to
use this method is by providing directly a flowCore::flowSet
, containing all
samples, as input parameter. Note that, for heavy datasets that include
a lot of samples, this can create memory issues. To handle this case, CytoMDS
provides other ways to call the pairwiseEMDDist()
function
(see ‘Handling heavy datasets’ section).
Using the channels
argument, it is possible to restrict the EMD calculation
to some of the channels. Here we simply pass as input the biologically
relevant markers selected in the previous section.
pwDist <- pairwiseEMDDist(
BCRXL_fs_trans,
channels = chMarkers,
verbose = FALSE
)
pwDistMatrix <- as.matrix(pwDist)
The return value of the pairwiseEMDDist
function is a DistSum
object.
We can use the as.matrix()
method to to convert this object into a matrix,
here a symmetric square matrix, with as many rows (columns) as input samples
(extract shown here below for the scale-transformed Bodenmiller2012 dataset).
round(pwDistMatrix[1:10, 1:10], 2)
## 1 2 3 4 5 6 7 8 9 10
## 1 0.00 10.46 4.30 11.18 6.39 12.69 7.11 11.85 5.88 10.13
## 2 10.46 0.00 10.91 3.16 11.45 6.06 13.17 10.61 9.61 6.08
## 3 4.30 10.91 0.00 10.72 7.44 13.17 7.47 12.84 5.84 10.00
## 4 11.18 3.16 10.72 0.00 12.10 5.97 12.66 9.63 10.35 6.72
## 5 6.39 11.45 7.44 12.10 0.00 9.19 4.45 10.19 5.28 9.96
## 6 12.69 6.06 13.17 5.97 9.19 0.00 10.56 7.23 11.61 7.74
## 7 7.11 13.17 7.47 12.66 4.45 10.56 0.00 7.24 8.89 11.87
## 8 11.85 10.61 12.84 9.63 10.19 7.23 7.24 0.00 14.50 11.68
## 9 5.88 9.61 5.84 10.35 5.28 11.61 8.89 14.50 0.00 7.10
## 10 10.13 6.08 10.00 6.72 9.96 7.74 11.87 11.68 7.10 0.00
One relevant way to visualize this distance matrix is to draw the histogram of pairwise distances, as shown in the below plot.
distVec <- pwDistMatrix[upper.tri(pwDist)]
distVecDF <- data.frame(dist = distVec)
pHist <- ggplot(distVecDF, mapping = aes(x=dist)) +
geom_histogram(fill = "darkgrey", col = "black", bins = 15) +
theme_bw() + ggtitle("EMD distances for Bodenmiller2012 dataset")
pHist
Since in CytoMDS, the EMD is calculated as an approximation, summing over
the one-dimensional marginal marker unidimensional distributions,
it is possible to obtain an individual contribution of each marker
to the distance matrix. This can be done using the distByFeature()
method:
DF <- distByFeature(pwDist)
DF[order(DF$percentage, decreasing = TRUE),]
## featureName distanceContrib percentage
## 3 pNFkB(Nd142)Dd 0.64393915 7.712532
## 10 pAkt(Sm152)Dd 0.62152388 7.444062
## 5 CD4(Nd145)Dd 0.58168675 6.966928
## 24 CD7(Yb176)Dd 0.55242877 6.616502
## 6 CD20(Sm147)Dd 0.51833588 6.208167
## 4 pp38(Nd144)Dd 0.50751539 6.078569
## 11 pStat1(Eu153)Dd 0.42774647 5.123168
## 23 HLA-DR(Yb174)Dd 0.39561819 4.738364
## 15 CD14(Gd160)Dd 0.39493170 4.730142
## 18 pPlcg2(Er167)Dd 0.38095496 4.562741
## 19 pErk(Er168)Dd 0.38044973 4.556689
## 22 pS6(Yb172)Dd 0.36274288 4.344613
## 20 pLat(Er170)Dd 0.34287354 4.106635
## 1 CD3(110:114)Dd 0.33383030 3.998323
## 17 pBtk(Er166)Dd 0.31021456 3.715475
## 16 pSlp76(Dy164)Dd 0.29994305 3.592452
## 14 pStat3(Gd158)Dd 0.29659401 3.552340
## 7 CD33(Nd148)Dd 0.20340106 2.436158
## 9 CD123(Eu151)Dd 0.17773632 2.128768
## 13 pZap70(Gd156)Dd 0.16365740 1.960143
## 8 pStat5(Nd150)Dd 0.13870620 1.661300
## 21 IgM(Yb171)Dd 0.13134337 1.573114
## 12 pSHP2(Sm154)Dd 0.09837732 1.178276
## 2 CD45(In115)Dd 0.08470643 1.014538
These individual marker contributions can also be displayed visually, using
the ggplotDistFeatureImportance()
method:
pBar <- ggplotDistFeatureImportance(pwDist)
pBar